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Abstract 


Classifiers for the semi-supervised setting often combine strong supervised models with ad¬ 
ditional learning objectives to make use of unlabeled data. This results in powerful though 
very complex models that are hard to train and that demand additional labels for optimal pa¬ 
rameter tuning, which are often not given when labeled data is very sparse. We here study 
a minimalistic multi-layer generative neural network for semi-supervised learning in a form 
and setting as similar to standard discriminative networks as possible. Based on normalized 
Poisson mixtures, we derive compact and local learning and neural activation rules. Learning 
and inference in the network can be scaled using standard deep learning tools for parallelized 
GPU implementation. With the single objective of likelihood optimization, both labeled and 
unlabeled data are naturally incorporated into learning. Empirical evaluations on standard 
benchmarks show, that for datasets with few labels the derived minimalistic network improves 
on all classical deep learning approaches and is competitive with their recent variants without 
the need of additional labels for parameter tuning. Furthermore, we find that the studied net¬ 
work is the best performing monolithic (‘non-hybrid’) system for few labels, and that it can be 
applied in the limit of very few labels, where no other system has been reported to operate so 
far. 


1 Introduction 

Deep neural networks (DNNs) have demonstrated state-of-the-art performance in many application 
domains. If large labeled databases and large computational resources are available, discriminative deep 
networks are now among the best performing systems in tasks such as image or speech recognition, 
document classification and many more [for example, Schmidhuber, 2015, Bengio et al., 2013, Hinton 
et al, 2012]. 

If no labels are available, unsupervised approaches are the method of choice, and those based on deep 
directed graphical models are well suited to capture the rich structure of typical data such as images or 
speech. However, while being potentially more powerful information processors than discriminative sys¬ 
tems, such directed models are typically trained on much smaller scales (either because of computational 
limits or performance saturation). For instance, deep sigmoid belief networks [SBNs, Saul et ah, 1996, 
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Gan et al., 2015] or newer models sueh as NADE [Laroehelle and Murray, 2011] have only been trained 
with a couple of hundred to about a thousand hidden units [Bomschein and Bengio, 2015, Gan et ah, 
2015], 

For settings of partly labeled training data, supervised and unsupervised approaches come together. These 
semi-supervised settings are increasingly interesting both for technical and practical reasons: While 
obtaining large amounts of data (like images or sounds) is often relatively easy, the effort to obtain labels 
is comparably high, for example, if manual hand-labeling of the data is required. Data sets with few labels 
therefore emerge as a natural application domain, and such settings have consequently shifted into the 
focus of many recent contributions [Liu et ah, 2010, Weston et ah, 2012, Pitelis et ah, 2014, Kingma et ah, 
2014, Rasmus et ah, 2015, Miyato et ah, 2015]. 

The most successful contributions in this semi-supervised setting so far have been hybrid combinations of 
two or more learning algorithms, which merge unsupervised and supervised learning [see, for example, 
Weston et ah, 2012, Kingma et ah, 2014, Rasmus et ah, 2015, Miyato et ah, 2015]. However, while 
deep neural networks alone are often already equipped with many tunable parameters (for architecture, 
regularization, sparsity etc.), such hybrid approaches add further parameters for the interplay between 
supervised and unsupervised learning. This makes their practical application to settings where only few 
labels are available difficult: In principle those models are able to train on very small amounts of labeled 
data with state-of-the-art results. However, to find suifable seffings of funable paramefers for such complex 
models, generally many more labels are needed fhan available during fraining in order fo avoid fhe risk of 
highly overfilling lo a very small validalion sel. Consequenlly, similar performance has never been shown 
when not only the amount of training labels, but the total amount of labels (that is, during training and 
tuning) was highly restricted. 

This work investigates the semi-supervised setting with a minimalistic, deep directed graphical model, 
which can be formulated as a neural network. The objective of likelihood optimization given by the 
graphical model directly combines information of unlabeled and labeled data in a monolithic learning 
system. With only a handful of resulting free parameters, tuning can be done even in settings, where 
labeled data is extremely sparse. Furthermore, the similarity to standard neural networks enables the 
application of software tools for parallelized learning on GPUs [like Bastien et ah, 2012]. This allows to 
scale the generative network to (ten-)thousands of hidden units (we here show networks with up to 20 000 
hidden units) and to apply it to large data sets (here, up to 400 000 samples). Finally, the use of local 
and compact inference and learning rules closely links the network to recent approaches for bio-inspired 
computer hardware, such as VLSI [especially Neftci et ah, 2015, Diehl and Cook, 2015, Nessler et ah, 
2013]. 


2 A Hierarchical Mixture Model for Classification 


A classification problem can be modeled as an inference task based on a probabilistic mixture model. 
Such a model can be hierarchical, or deep, if we expect the data to obey a hierarchical structure. For 
hand-written digits, for instance, we first assume the data to be divided into digit classes (‘0’ to ‘9’) and 
within each class, we expect structure that distinguishes between different writing styles. Most deep 
systems allow for a much deeper substructure, using five, fen, or recenfly even up fo 100 or 1000 layers 
[He el ah, 2015]. For our goal of semi-supervised learning wilh few labels, we however wanl lo resfrain 
fhe model complexify lo fhe necessary minimum of a hierarchical model. 
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2.1 The Generative Model 


2.1 The Generative Model 

In accordance with the hierarchical formulation of a classification problem, we 
define fhe minimalisfic hierarchical generative model shown in fig. 1 as follows: 

P{k) = ^, Pil\k) = dik ( 1 ) 

p{c\k,Tl) = Ukc, '^'R-kc = 1 (2) 

C 

P{y\c W) = n Poisson(y^; Wed), ^ 

d d 

The parameters of the model, W G M^g ^ and TZ G will he referred to 

as generative weights, which are normalized to constants A and 1, respectively. 

The top node (see fig. 1) represents K abstract concepts or super classes k with 
labels I (for example, digits ‘0’ to ‘9’). The middle node represents any of the 
occurring C subclasses c (like different writing styles of the digits). And the 
bottom nodes represent an observed data sample y, which is generated by the 
model according to a Poisson distribution, and the data label I, which is given by a Kronecker delta, that 
is, without label noise. Here, we assume non-negative observed data and use the Poisson distribution 
as an elementary distribution for such data (compare restricted Boltzmann machines or sigmoid-belief- 
networks). While a Poisson distribution is a natural choice for non-negative data, it also turns out to be 
mathematically convenient for the derivation of our inference and learning rules. 

Note for the model in eqs. (1) to (3), that while the normalization of the rows of TZ is required for normal¬ 
ized categorical distributions, the normalization of the rows of W represents an additional assumption of 
our approach. By constraining the weights to sum to a constant A, the model expects contrast normalized 
data. If the dimensionality D of the observed data is sufficiently large, we can simply normalize the data 
such that YldVd ~ ^ order to fulfill this constraint with high accuracy. Denoting the unnormalized 
data points by y, we here assume the normalized data points y to be obtained as follows: 

y, = (A-D)^^ + l. (4) 

l^d' Vd' 



C"' "\ 

! C ; 


Wed 

\l : {y) 

Figure 1: Graphical illustra¬ 
tion of the hierarchical gen¬ 
erative model. 


To generate an observation y from the model, we first draw a super class k from a uniform categorical 
distributionp(A:). Next we draw a subclass c according to the conditional categorical distributionp(c| A:, TZ). 
Given the subclass, we then sample y from a Poisson distribution and assign to it the label I corresponding 
to class k. Eqs. (1) to (3) define a deep mixture model. 


2.2 Maximum Likelihood Learning 


To infer the model parameters 0 = (W, TZ) of the deep Poisson mixture model eqs. (1) to (3) for a given 
set of N independent observed data points {y^'^^}n=i,...,N with y V^d'^ — labels 

we seek to maximize the data (log-)likelihood 


N 


N / C e D 


CiQ) = logl[p{y^^\l^^^\e) E n 


n=l 


n=l \c=l ^ d=l 


(n) 


(5) 
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Here, we assume that some or all of the data come with a label. For unlabeled data, the summation over k 
is a summation over all possible labels of the given data, that is, k = I... K. Whereas whenever the label 
is known for a data point this sum is reduced to A: = such that only weights Ry-a)^ contribute 
for that nth data point. 

Instead of maximizing the likelihood directly, EM [in the form studied by Neal and Hinton, 1998] 
maximizes a lower bound—the free energy—given by: 

N 

0) = E (logp(y c, A:|0))^ + (6) 

n=l 

where ( )n denotes the expectation under the posterior 

C K 

(/(0 ^))n = E E 

c=l k=l 

and % [0°^'^] is an entropy term only depending on parameter values held fixed during the optimization of 
F w.r.t. 0. For our model, the free energy as a lower bound of the log-likelihood reads 

.F(0°i^0) = 0°!'^) ( ^(yWlog(W,rf) - Wed-log(r(yf+l))) 

n,c,k d=l (8) 

+ log(7^fc,) -log(it:)) +il^[0°''i]. 

The EM algorithm optimizes the free energy by iterating two steps: Eirst, given the current parameters 
0°^*^, the relevant expectation values under the posterior are computed in the E-step. Given these posterior 
expectations, J^(0°*^, 0) is then maximized w.r.t. 0 in the M-step. Iteratively applying E- and M-steps 
locally maximizes the data likelihood. 


M-step. The parameter update equations of the model can canonically be derived by maximizing the 
free energy eq. (8) under the given boundary conditions of eqs. (2) and (3). By using Lagrange multipliers 
for constrained optimization, we obtain after straightforward derivations: 


^ EnPjklc, 0°^d) 

Ec' EnP(^|c', 0°''i)p(c'|y H /W, 0°id) • 


(9) 

( 10 ) 


For details please refer to appendix A. 1. 


E-step. For the hierarchical mixture model, the required posteriors over the unobserved latents in eqs. (9) 
and (10) can be efficiently computed in closed-forms in the E-step. Due to an interplay of the used Poisson 
distribution and the constraint for W of eq. (3), the equations greatly simplify, and can be shown to follow 

(n) (n) 

a softmax function with weighted sums over inputs and as arguments (see appendix A.l): 


p(c|yW ^ i .with 


Ec' exp(/^r^^ 


( 11 ) 
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'I = j_ 


+ log(^ 

k 

(12) 

n) for labeled data 

(13) 

for unlabeled data 


Also note, that the posteriors p{c\y, I, 0) for labeled data and p{c\y, 0) for unlaheled data only differ in 
the ehosen distribution for 

For the E-step posterior over elasses k, we obtain: 




p(fc|c,0°id) = ^^ 


for labeled data 
for unlabeled data 


(14) 


The expression for unlabeled data makes use of the assumption of a uniform prior in eq. (1). Under the 
assumption of a non-uniform class distribution, the weights TZkc would be weighted by the priors p{k), 
which here simply cancel out. 


Probabilistically Optimal Classification. Once we have obtained a set of values for model parame¬ 
ters 0 by applying the EM algorithm on training data, we can use the optimized generative model to infer 
the posterior distribution p{k\y, 0) given a previously unseen observation y. Eor our model this posterior 
is given by 


p{k\y,e) = V — p{c\y,e). (15) 

While this expression provides a full posterior distribution, the maximum a-posteriori (MAP) value can be 
used for deterministic classification. 


3 A Neural Network for Optimal Hierarchical Learning 

Eor the purposes of this study, we now turn to the task of finding a neural network formulation that 
corresponds to learning and inference in the hierarchical generative model of sec. 2. The study of optimal 
learning and inference with neural networks is a popular research field, and we here follow an approach 
similar to Eiicke and Sahani [2008], Nessler et al. [2009], Keck et al. [2012] and Nessler et al. [2013]. 


3.1 A Neural Network Approximation 

Consider the neural network in fig. 2 with neural activities y, s and t. We refer to neurons ^ as the 
observed layer, the neurons make up the first hidden layer, and the neurons ^ form the second 
hidden layer. We assume the values of y to be obtained from a set of unnormalized data points y by eq. (4), 
and the label information to be presented as top-down input vector u as given in eq. (13). 

Eurthermore, we assume the neural activities s and t to be normalized to B and B' respectively (such 
that Vd = '^k = 1’ Ylc = B, and Yk 4 = ^ '''hh A > D; B,B' > 0). Eor the neural 

weights (W, i?) of the network—which we distinguish for now from the generative weights (W, TZ) of the 
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Ud 


y 



u 


mixture model—we consider Hebbian learning 
with a subtractive synaptic scaling term [see for 
example Abbott and Nelson, 2000]: 

Al^crf = ewiScVd - ScWcd) (16) 

ARkc = eR{tf,s^-tf,Rkc), (IV) 


where > 0 and > 0 are learning rates. These 
learning rules are local, can integrate both super¬ 
vised and unsupervised learning, are highly par- 
allelizable and they result in normalized weights, 
that we can relate to our generative model as fol¬ 
lows: By taking sums over d and c respectively, 
we observe that the learning dynamics results in 
'^d to converge to A and ^kc to converge 
to B (due to activities y and s being normalized 
accordingly). If we therefore now assume the weights W and R to be normalized to A and B, respectively, 
we can compute how a given weight adapts with cumulative learning steps. For small learning rates, 
we can approximate the weight updates by AWcd = ewSciJd ^^d AR^c = f-Rtk^c followed by explicit 
normalization to A and B, respectively. Using the superscript (n) to denote the parameter states and 
activities of the network at the nth learning step, we can write the effect of such subsequent weight updates 
as 


Figure 2: Graphical illustration of the hierarchical recurrent 
neural network. 




(n+l) 


cd 


= A- 






(n) (n)\ 

-Sc ’y^d' ) 


and R 


(n+l) 

kc 


= B- 


^kc + ^ 






(^) Ari) ■ 


(18) 


where R^""^) denotes the activation of neurons at the nth iteration, which 

depends on inputs y u and the weights . Similarly, {s^^\ u R ^'^^) depends 

on u^"'\ and R^'^\ By iteratively applying eqs. (18) for N times, we can obtain formulas for the 
weights and R ^^')—the weights after having learned from N data points. If learning converges and 

N is large enough, these can be regarded as the converged weights. It turns out, that the emerging large 
nested sums can, at the point of convergence, be compactly rewritten through the use of Taylor expansions 
and the geometric series. Appendix A.2 gives details on the necessary analytical steps. As a result, we 
obtain that the following equations must be satisfied for W and R at convergence: 


Wed 




and 


Rkc 


B 


Z-jn Sc 
l^cl^n‘'k Sc 


(19) 


Eqs. (19) become exact fixed points for learning in eqs. (16) and (17) in the limit of small learning rates 
ew and and large numbers of data points N. Given the normalization constraints demanded above, 
eqs. (19) apply for any neural activation rules for and as long as learning follows eqs. (16) and (17) 
and as long as learning converges. 

For our purpose, we identify with the posterior probability p{c\y, I, 0) for labeled data and p{c\y, 0) 
for unlabeled data given by eqs. (11) to (13) with 0 = {W, R)\ 


sW;=p(c|yW /W, 0W) 


exp(ji"^^) 
Ec' exp(/^r^) 


, with 


( 20 ) 
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3.1 A Neural Network Approximation 




'+iog(E«r’<'; 


( 21 ) 


(n) 

and as given by eq. (13), which incorporates the label information. 

Furthermore, we identify with the posterior distribution over classes k, which for labeled data is p{k\l) 
given in eq. (14) and for unlabeled data p{k\y, 0) as given by eq. (15): 


fW) _ 

■— 


p{k\l(-)) = 

(n) 

Efe' 


for labeled data 
for unlabeled data 


( 22 ) 


The complete set of activation and learning rules, after identifying neural activities and with the 
respective posterior distributions, are summarized in tab. 1: 


Neural Simpletron 

Input 



Bottom-Up: 

unnormalized data 

(T1.1) 

Top-Down: 

1 Ski for labeled data 

^fc= S 

1 ^ for unlabeled data 

(T1.2) 

Activation Across Layers 


Obs. Layer: 


(T1.3) 

1®* Hidden: 


(T1.4) 


E = T,d log(fk:d)j/d + log(Efc “fcRfec) 

(T1.5) 

Hidden: 

^ 1 Mfc labeled data 

1 for unlabeled data 

(T1.6) 

Learning of Neural Weights 


F* Hidden: 

AWed = ewiSeVd - SeWed) 

(T1.7) 

2^^ Hidden: 

ARkc = ^nitk^c ~ tkEkc) for labeled data 

(T1.8) 


Table 1: Neural network formulation of probabilistic inference and maximum likelihood learning. 


By comparing eqs. (19) with the M-step eqs. (9) and (10), we can now observe, that such neural learning 
converges to the same fixed points as EM for the hierarchical Poisson mixture model (note, that we set 
B = B' = 1 as and sum to one). While the identification of Wed with Wed at convergence is 
straightforward, we have to restrict learning of TZke to labeled data to gain a neural equivalent in Rke- In 
that case p{k\c, l^'^\ 0°*^) = p{k\l^'^'>), which corresponds to our chosen activities for labeled inputs. 
(In sec. 3.3, we will show a way to loosen up on this restriction by using self-labeling on unlabeled data 
with high inference certainty.) 

In other words, by executing the online neural network of tab. 1, we optimize the likelihood of the 
generative model eqs. (1) to (3). The neural activities therein provide the posterior probabilities, which 
we can, for example, use for classification. The computation of posteriors is in general a difficult and 
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computationally intensive endeavor, and their interpretation as neural activation rules is usually difficult. 
In our case, because of a specific interplay between introduced constraints, categorical distribution and 
Poisson noise, the posteriors and their neural interpretability greatly simplify, however. 

All equations in tab. 1 can directly be interpreted as neural activation or learning rules. Let us consider an 
unnormalized data point y = ,..., as bottom-up input to the network. Labels are neurally coded 

as top-down information u = {u^, ..., where only the entry Ui equals one if I is the label, and all 
other units are zero^ In the case of unlabeled data, all labels are assumed as equally likely at 1/K. As 
first processing step a divisive normalization Eq. (TL3) is executed to obtain activations y^. Considering 
Eqs. (TL4) and (TL5), we can interpret as input to neural unit s^. The input consists of a bottom-up 
and a top-down activation. The bottom-up input is the standard weighted summation of neural networks 
log{Wcd)ya (note, that we could redefine the weights by Wed •= log Wed). Likewise, the top-down 
input is a standard weighted sum, UjJikc, but affects the input through a logarithm. Both sums can 
be computed locally at the neural unit c. The inputs to the hidden units are then combined using a 
softmax function, which is also standard for neural networks. However, in contrast to discriminative 
networks, the weighted sums and the softmax function are here a direct result from the correspondence 
to a generative mixture model [compare also Jordan and Jacobs, 1994]. The activation of the top layer, 
Eq. (TL6), is either directly given by the top-down input if the data label is know. Or, for unlabeled 
data, the inference takes again the form of a weighted sum over bottom-up inputs, which are now the 
activations from the middle layer. Regarding learning, both Eqs. (TL7) and (TL8) are local Hebbian 
learning equations with synaptic scaling. The weights of the first hidden layer are updated on all data 
points during learning, while those of the second hidden layer only learn from labeled input data. 

As control of our analytical derivation of tab. 1, we verified numerically that the local optima of the 
neural network are indeed also local optima of the EM algorithm. Note in this respect, that, although 
neural learning has the same convergence points as EM learning for the mixture model, in finite distances 
from the convergence points, neural learning follows different gradients, such that the trajectories of the 
network in parameter space are different from EM. By adjusting the learning rates in Eqs. (TL7) and 
(TL8), the gradient directions can be changed in a systematic way without changing the convergence 
points, which we observed to be beneficial to avoid convergence to shallow local optima. 

The equations defining the neural network are elementary, very compact, and contain a total number of 
only four free parameters: the number of hidden units C, an input normalization constant A, and learning 
rates e^/and Cu- Because of its compactness we call the network AeuraZ Simpletron (NeSi). 

In the experiments in sec. 4, we differentiate between four neural network approximations on the basis of 
tab. 1. These result from two different approximations of the activations in the first hidden layer, and two 
different approximations for the activations in the second hidden layer, which gives a total of two by two 
different networks to investigate. These approximations in the first and second hidden layer are discussed 
in the following two subsections respectively. 


3.2 Recurrent, Feedforward and Greedy Learning 

The complete formulas for the first hidden layer, given in Eqs. (TL4) and (TL5), define a recurrent 
network, that is, a network that combines both bottom-up and top-down information: The first summation 
in incorporates the bottom-up information. Due to the chosen normalization in Eq. (TL3) with a 
background value of -|-1, all summands in this term are non-negative. Values of the sum over these 

'This is sometimes referred to as ‘one-hot’ coding. 
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3.3 Self-Labeling 


bottom-up connections will be high for input data y that was generated by the hidden unit c. The second 
summation in incorporates top-down information. The weighted sum inside the logarithm, which can 
take the label information into account, will always yield values between zero and one. Thus, because 
of the logarithm, this second term is always non-positive and suppresses the activation of the unit. This 
suppression is stronger, the less likely it is, that the given hidden unit c belongs to the class of the provided 
label I (for labeled data) and the less likely it is, that this unit becomes active at all. Because of these 
recurrent connections between the hrst and second hidden layer, we will refer to our method tab. 1 as 
r-NeSi (‘r’ for recurrent) in the experiments. Whereby, with ‘recurrent’ we do not mean a temporal 
memory of sequential inputs, but only the direction in which information flows through the network 
[following, for example, Dayan and Abbott, 2001]. 

To investigate the influence of such recurrent information in the network, we also test a pure feedforward 
version of the first hidden layer. There, we remove all top-down connections by simply discarding the 
second term in Eq. (T1.5). Such a feedforward formulation of the network is equivalent to treating the 
distribution p{c\k, R) in the first hidden layer as a uniform prior distribution p{c) = 1/C. We will refer 
to this feedforward network as ‘ff-NeSi’ in the experiments. Since ff-NeSi is stripped of all top-down 
recurrence and the fixed points of the second hidden layer now only depend on the activities of the first 
hidden layer at convergence, it can also be trained disjointly using a greedy layer-by-layer approach, 
which is customary for deep networks. 

3.3 Self-Labeling 

So far, we trained the top layer of NeSi completely supervised by updating the weights in Eq. (T1.8) only 
on labeled data. When labeled data is sparse, it could be beneficial to also make use of unlabeled data 
in this layer. We can do so, by letting the network itself provide the missing labels [a procedure often 
termed ‘self-labeling’, see, for example, Lee, 2013, Triguero et al., 2015]. The availability of the full 
posterior distribution in the network (Eq. T1.6 for unlabeled data) herein allows us to selectively only use 
those inferred labels where the network shows a very high classification certainty. As index for decision 
certainty we use the ‘Best versus Second Best’ (BvSB) measure on which is simply the absolute 
difference between the most likely and the second most likely prediction. Such a measure gives a sensible 
indicator for high skewness of the distribution towards a single class [Joshi et al., 2009]. If the BvSB lies 
above some threshold parameter which we treat as additional free parameter, we can approximate the 
full posterior in tf. by the MAP estimate. In that case, we set MAP(f^), such that for unlabeled 
data now holds the ’one-hot’ coded inferred label information, with which we update the top layer in the 
usual fashion using Eq. (T1.8). 

This specific manner of using inferred labels in the neural network is again not imposed ad hoc, but can 
be derived from the underlying generative model by considering the M-step eq. (10) for unlabeled data. 
When in the generative model the posterior p{k\y, 0) = 0)p(c|y, 0) comes close to a hard 

max, it must be that p(c|i7, 0) is only for those units c dominantly at high values that belong to the same 
class. Eor these units, we can then replace p{k\c, 0) by the MAP estimate in close approximation. We 
can therefore rewrite the products in eq. (10) for unlabeled data as 

p{k\c, 0)p(c|y W 0) ^ 5,; p(c|y H 0) Vn G iV : p{k\y^^\ 0) ^ 5^^, (23) 

with the inferred label 1. Here, for all data points n ^ N with high classification certainty, p{c\y^'^\ 0) 
acts as a filter, such that only those terms contribute, where p{k\c, 0) is close to a hard max. With this 
approximation, we can replace the dependency of the first factor in eq. (23) on specific units c by a 
common dependency on all units that are connected to unit k (as the inferred label I depends on all those 
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units). These results we are then able to translate again into neural learning rules, where the top layer 
aetivation is only dependent on the eomhined input to that unit, as done above. 

We mark those NeSi networks where we use self-labeling in the top layer with (that is, ‘r'''-NeSi’ and 
‘ff'^-NeSi’). Although we here use the MAP estimate of during training, because of the validity of 
eq. (23) at high inference certainty, we are still learning in the context of the generative model eqs. (1) 
to (3). Thus, we still keep the full posterior distribution in for inference, as well as all identifications of 
sec. 3.1. 


4 Numerical Experiments 


We apply an efficiently scalable implementation^ of our network to three standard benchmarks for 
classification: the 20 Newsgroups text data set [Lang, 1995], the MNIST data set of handwritten digits 
[LeCun et al., 1998] and the NIST Special Database 19 of handwritten characters [Grother, 1995]. To 
investigate the semi-supervised task, we randomly divide the training parts of the data sets into labeled 
and unlabeled partitions, where we make sure that each class holds the same number of labeled training 
examples, if possible. We repeat experiments for different proportions of labeled data and measure the 
classification error on the blind test set. For all such settings, we report the average test error over a given 
number of independent training runs with new random labeled and unlabeled data selection. Details on 
parallelization and weight initialization can be found in appendix B. Detailed statistics of the obtained 
results are given in appendix C. 


4.1 Parameter Tuning 

For the NeSi algorithms, we have four free parameters: the normalization constant A in the bottom layer, 
the number of hidden units C and the learning rate in the middle layer, and the learning rate in the 
top layer. When using the optional self-labeling, we have a fifth free parameter as BvSB threshold, also 
in the top layer. 

To optimize the free parameters in the semi-supervised setting with only few labeled data points, it is 
customary to use a validation set, which comprises additional labeled data to the available amount of 
labels in the training set of that given setting (for example, using a validation set of 1000 labeled data 
points to tune parameters in the setting of 100 labels). As this procedure does not guarantee that the 
resulting optimal parameter setting could have also been found with the limited amount of labels in the 
given setting, such achieved results reflect more of the performance limit of the model than the actual 
performance when given only very restricted amounts of labeled data. As already in Forster et al. [2015], 
we therefore not only train our model on such limited labeled data, but also tune all free parameters in this 
same setting without any additional labeled data. This way we make sure, that our results are achievable 
by using no more labels than provided within each training setting. Furthermore, using only training data 
for parameter optimization assures a fully blind test set, such that the test error gives a reliable index for 
generalization. 

To construct the training and validation set for parameter tuning, we regard the setting of 10 labeled 
training data points per class (that is, 100 labeled data points for MNIST and 200 for 20 Newsgroups). 

^ We use a python 2.7 implementation of the NeSi algorithms, which is optimized using Theano to execute on NVIDIA 
GeForce GTX TITAN Black and TITAN X GPUs. Details can be found in appendix B.l. We have provided the source code and 
scripts for repeating the experiments discussed here along with the submission. 
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4 NUMERICAL EXPERIMENTS 


4.2 Document Classification (20 Newsgroups) 


This is the setting with the lowest number of labels, on which models are generally compared on MNIST. 
For simplicity’s sake, we take half of this labeled data as validation set (class balanced and randomly 
drawn) and use the other labeled half plus all unlabeled training data as training set for parameter tuning. 
With this data split, we optimize the parameters of the r-NeSi network via a coarse manual grid search. 
For the search space, we may consider run time vs. performance trade-offs where necessary (for example, 
with an upper bound on the network size or a lower bound on the learning rates). Keeping the optimized 
parameter setting of r-NeSi fixed, we only optimize r? for r'''-NeSi. For comparison, we keep the same 
parameter settings for the feedforward networks (ff-NeSi and ff'^-NeSi) without further optimization. 

Once optimized in this semi-supervised setting, we keep the free parameters fixed for all following 
experiments. When evaluating the performance of the networks, we perform repeated experiments with 
different sets of randomly chosen training labels. This evaluation scheme is of course only possible with 
more labels available than used by each single network. Flowever, this procedure is purely to gather 
meaningful statistics about the mean and variance of the acquired results, as these can vary based on 
the set of randomly chosen labels. As the experiments are performed independently of each other and 
the parameters are not further tuned based on these results on the test set, it is safe to say, that the 
acquired results are a statistical representation of the performance of our models given no more than the 
corresponding number of labels in each setting. 

A more rigorous parameter tuning would also allow for retuning of all parameters for each model and 
each new label setting, making use of the additional training label information in the stronger labeled 
settings, which we however refrained to do for our purposes. The overall tuning, training and testing 
protocol is shown in fig. 3. 


Tuning 


Training Testing 


Full Training Set Full Training Set (Blind) Test Set 




Figure 3: Tuning, training and testing protocol for the NeSi algorithms. During tuning, the free parameters are optimized on a 
split of the training data into a training and validation set with 5 randomly chosen labeled data points per class in each, and 
all remaining unlabeled data points in the training set. These data sets with their chosen labels remain fixed during all tuning 
iterations. With the resulting set of optimized free parameters, the network is then trained on all available training data and labels 
in the given setting and is evaluated on the fully blind test set. This last training and testing step is repeated with a new, randomly 
chosen, class balanced set of training labels for multiple independent iterations to gain the mean generalization error of the 
algorithms. 


4.2 Document Classification (20 Newsgroups) 

The 20 Newsgroups data set in the ‘bydate’ version consists of 18 774 newsgroup documents of which 
11 269 form the training set and the remaining 7505 form the test set. Each data vector comprises the 
raw occurring frequencies of 61188 words in each document. We preprocess the data using only tf-idf 
weighting [Sparck Jones, 1972]. No stemming, removals of stop words or frequency cutoffs were applied. 
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4 NUMERICAL EXPERIMENTS 


The documents belong to 20 different classes of newsgroup topics that are partitioned into six different 
subject matters (‘comp’, ‘rec’, ‘sci’, ‘forsale’, ‘politics’ and ‘religion’). We show experiments for both 
classification into subject matter (6 classes) as well as the more difficult full 20-class problem. 


4.2.1 Parameter Tuning on 20 Newsgroups 

In the following, we give a short overview over the parameter tuning on the 20 Newsgroups data set. We 
use the procedure described in sec. 4.1 to optimize the free parameters of NeSi using only 200 labels in 
total, while keeping a fully blind test set. The parameters are optimized with respect to the more common 
20-class problem, and we then keep the same parameter setting also for the easier 6-class task. We allowed 
training time over 200 iterations over the whole training set and restricted the parameters in the grid search 
such that convergence was given within this limitation. 


Hidden Units. Following the above tuning protocol for 20 Newsgroups (20 classes) results in a best 
performing architecture of D-C-K = 61188-20-20, that is, the complete setting C = K = 20. Generally 
we would expect, that the overcomplete setting C > K would allow for more expressive representations. 
This is indeed the case for the 6-class problem (K = 6) for which we find that C = 20 (61188-20-6) 
is the best setting but more middle-layer classes were not beneficial for the 20-class problem. Using 
more than 20 middle layer units (C > 20) for K = 20 problem could be hindered here by the high 
dimensionality of the data relative to the number of available training data points as well as the prominent 
noise when taking all words of a given document into account. 


Normalization. Because of the introduced background value of -i-l (see Eq. T1.3), the normalization 
constant A has a lower bound in the dimensionality of the input data D = 01 188. For very low values 
A> D, the model is unable to differentiate the observed patterns from background noise. At the other 
extreme, at A —)■ oo, the softmax function will converge to a winner-take-all maximum function. The 
optimal value lies in between, closely after the system is able to differentiate all classes from background 
noise but when the normalization is still low enough to allow for a broad softmax response. For all our 
experiments on the 20 Newsgroups data set we chose (following the tuning protocol) A = 80 000 (that is, 
A/D ^ 1.31). 


Learning Rates. A relatively high learning rate in the first hidden layer (cw = 5 x C/N), coupled with 
a much lower learning rate in the second hidden layer (e^j = 0.5 x K/L), yielded the best results on 
the validation set. Especially the high value for seems to have the effect of more efficiently avoiding 
shallow local optima, which exist, again, due to noise and the high dimensionality of the data compared to 
the relatively low number of training samples. The different learning rates for ei^and mean that the 
neural network follows a gradient markedly different from an EM update. This suggests, that the neural 
network allows for improved learning compared to the EM updates it was derived from. 

Note, that in practice we use normalized learning rates. The factor C/N for the first hidden layer and 
K/L for the second hidden layer represents the average activation per hidden unit over one full iteration 
over a data set of N data points with L labels. Tuning not the absolute learning rate but the proportionality 
to this average activation helps to decouple the optimum of the learning rates from the network size (C 
and K) and the amount of available training data and labels (N and L). 
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4 NUMERICAL EXPERIMENTS 


4.2 Document Classification (20 Newsgroups) 


BvSB Threshold. Given the optimized values of the other free parameters, we found that introducing 
the additional self-laheling for unlaheled data is not helpful and even harmful for the 20 Newsgroups 
data set. Since even in the settings with only very few labeled data points, the number of provided labels 
per middle layer hidden unit is already sufficiently large, the usage of inferred labels only introduces 
destructive noise. The self-labeling will show to be more useful in scenarios where the number of hidden 
units surpasses the number of available labeled data points greatly (like for MNIST, sec. 4.3, and NIST, 
sec. 4.4). 

4.2.2 Results on 20 Newsgroups (6 classes) 

We start with the easier task of subject matter classification, where the twenty newsgroup topics are 
partitioned into six higher-level groups that combine related topics (‘comp’, ‘rec’, etc.). The optimal 
architecture for 20 Newsgroups (20 classes) on the validation set was given in the complete setting, where 
C = K = 20. At first glance, this seems like no subclasses were learned and that the split in the middle 
layer was primarily guided by class labels. However, also for classification of subject matters (6 classes), 
where only labels of the six higher-level topics were given, we observed the setting with C = 20 units 
(61188-20-6) to be far superior to the complete setting with architecture 61188-6-6 (see tab. 2). This 
suggests, that the data structure of 20 subclasses determines the optimal architecture of the NeSi network 
and not the number of label classes (see also secs. 4.3 and 4.4). In our experiments we furthermore 
observed the feedforward network, which learns completely unsupervised in the middle layer, to still 
achieve a similar performance as the recurrent r-NeSi network. This shows, that the NeSi networks are 
able to recover individual subclasses of the newsgroups data independently of the label information. 


#labels 

ff-NeSi 

C = 6, /C = 6 C = 20, /C = 6 

r-NeSi 

C = 6, K = 6 C = 20, K = 6 

200 

41.66 ±1.21 

14.23 ±0.45 

39.02 ±1.49 

14.21 ±0.42 

800 

40.41 ±1.31 

14.04 ±0.48 

39.54 ±1.64 

14.58±0.75 

2000 

42.31 ±0.72 

14.26 ±0.47 

40.05 ±0.64 

13.44±0.43 

11269 

41.85 ±0.90 

14.95 ±0.73 

36.56 ±2.09 

13.26±0.35 


Table 2: Test error for 10 independent runs on the 20 Newsgroups data set, when classes are combined by their corresponding 
subject matters (classification into K = 6 classes). Here the overcomplete setting (C > K) shows best results, where the 
network is able to learn the 20 individual subclasses present in the data. 


4.2.3 Results on 20 Newsgroups (20 classes) 

We now continue with the more challenging 20-class problem (K = 20). Here, we investigate semi- 
supervised settings of 20, 40, 200, 800 and 2000 labels in total—that is 1, 2, 10, 40 and 100 labels per 
class—as well as the fully labeled setting. For each setting, we present the mean test error averaged over 
100 independent runs and the standard error of the mean (SEM). On each new run, a new set of class 
balanced labels is chosen randomly from the training set. We train our model on the full 20-class problem 
without any feature selection. An example of some learned weights of r-NeSi is shown in fig. 4. 

To the best of our knowledge, most methods that report performance on the same benchmark do consider 
easier tasks: They either break the task into binary classification between individual or merged topics [such 
as Cheng et ah, 2006, Kim et al., 2014, Wang and Manning, 2012, Zhu et ah, 2003], and/or perform feature 
selection [for instance, Srivastava et al., 2013, Settles, 2011] for classification. There are however works 
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4.2 Document Classification (20 Newsgroups) 


4 NUMERICAL EXPERIMENTS 





Figure 4: Example of learned weights by the r-NeSi algorithm in the semi-supervised setting of 800 labels. Shown are the 
20 features with the highest learned tf-idf occurrence frequency for each of the 20 hidden units as bar plot (scaled relatively to 
the most likely feature). Columns next to each field show the corresponding learned class assignment. Each field is labeled by 
the class k with the highest probability p{k\c) for that field c. For that most likely class, the probabilities p{k\c) and p{c\k) are 
given. 


that are compatible with our experimental setup [Larochelle and Bengio, 2008, Ranzato and Szummer, 
2008]. A hybrid of generative and discriminative RBMs (HDRBM) trained by Larochelle and Bengio 
[2008] uses stochastic gradient descent to perform semi-supervised learning. They report results on 
20 Newsgroups for both supervised and semi-supervised setups. In the fully labeled setting, all their 
model- and hyperparameters are optimized using a validation set of 1691 examples with the remaining 
9578 in the training set. In the semi-supervised setup 200 examples were used as validation set with 800 
labeled examples in the training set. To reduce the dimensionality of the input data, they only used the 
5000 most frequent words. The classification accuracy of the method is compared in tab. 3. 

Here, the recurrent and feedforward networks produce very similar results, with a small advantage to 
the recurrent networks. This small advantage could however also be explained by a bias in our tuning 
procedure, where the parameters are specifically optimized for fhe recurrenf model. In comparison wifh 
HDRBM, ff-NeSi and r-NeSi bofh achieve heller resulls lhan fhe competing model for fhe semi-supervised 
selling. Bolh algorilhms are still heller wilh down lo 200 labels, even Ihough HDRBM uses more labels 
for training and additional labels for parameter tuning. Performance only very significantly decreases 
when going down even further to only one or two labels per class for training (note, that the parameters 
were actually tuned using 200 labels in total). 
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4 NUMERICAL EXPERIMENTS 


4.3 Handwritten Digit Recognition (MNIST) 


#labels 

ff-NeSi 

r-NeSi 

HDRBM 

20 

70.64 ±0.68 (*> 

68.68 ±0.77 (*) 


40 

55.67 ±0.54 (*) 

54.24 ±0.66 (*> 


200 

30.59 ±0.22 

29.28 ±0.21 


800 

28.26 ±0.10 

27.20 ±0.07 

31.8 (*) 

2000 

27.87 ±0.07 

27.15 ±0.07 


11269 

28.08 ±0.08 

27.28 ±0.07 

23.8 


Table 3: Test error on 20 Newsgroups for different label-settings using the feedforward and the recurrent Neural Simpletrons. 
We differentiate here between settings with different amounts of labels available during training. For results marked with the 
free parameters of the model were optimized using additional labels; NeSi used the same parameter setting in all experiments on 
20 Newsgroups, which was tuned with 200 labels in total; FIDRBM used 1000 labels in total for tuning in the semi-supervised 
setting. 


4.2.4 Optimization in the Fully Labeled Setting 

In the fully labeled setting, the HDRBM outperforms the shown NeSi approaches significantly. However, 
we have so far used one parameter tuning fixed for all settings. We can further optimize for a specific 
setting, here the fully labeled one. In that setting, we can still gain a larger benefit out of the recurrence 
of r-NeSi: Changing its initialization procedure from R/^c = l/C* to Rj-c = helped to avoid shallow 
local optima and reached a test error of (17.85 ± 0.01)%. This initialization fixes the class k of subclass 
c to a single specific class by setting all connections between the first and second hidden layer to other 
classes to zero. Training with such a weight initialization is however only useful when very large amounts 
of labeled data are available. The top-down label information is then an important mechanism to make 
sure, that the middle layer units learn the appropriate representation of their respective fixed class (for 
example, that a middle layer unit that is fixed to class ‘alt.atheism’ mainly, or exclusively, learns from data 
belonging to that class). So, instead of first learning representations in the middle layer purely from the 
data and then learning the classes with respect to these representations from the labels, like the (greedy) 
ff-NeSi, the r-NeSi algorithm is able to also conversely shape their middle layer representations in relation 
to their probability to belong to the class of the presented data point. 

To decide between this initialization procedure in the fully labeled setting and our standard one, we here 
used the fully labeled training set during parameter tuning (again with a half/half split into training and 
validation set). With the better avoidance of shallow optima by this initialization, lower learning rates 
Cw were now more beneficial (en drops out as free parameter, as the top layer remains fixed). A coarse 
manual grid search in this setting resulted in optimal parameter values at A = 90 000 (A/D ^ 1 .47) and 
ew = 0.02 (which we chose as lowest search value to restrict computational time), while keeping C = 20. 
These results also show, that parameter optimization based on each individual label setting (instead of 
just on the weakliest labeled setting) and changing the initialization procedure based on label-availability 
could potentially lead to better parameter settings and stronger performance also in the other settings. 


4.3 Handwritten Digit Recognition (MNIST) 

The MNIST data set consists of 60 000 training and 10 000 testing data points of 28x28 images of gray¬ 
scale handwritten digits which are centered by pixel mass. We perform experiments in the semi-supervised 
setting using 10, 100, 600, 1000 and 3000 labels in total, which are randomly and class balanced chosen 
from the 10 classes. Additionally, we consider the setting of a fully labeled training set. 
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4.3 Handwritten Digit Recognition (MNIST) 


4 NUMERICAL EXPERIMENTS 


4.3.1 Parameter Tuning on MNIST 

We here give a short overview over the parameter tuning on the MNIST data set. We again use the tuning 
procedure described in sec. 4.1 to optimize all free parameters of NeSi using only 100 labels in total from 
the training data, keeping a fully blind test set. We allowed training time over 500 iterations over the 
whole training set and restricted the parameters in the grid search such that convergence was given within 
this limitation. 


Hidden Units. Contrary to the 20 Newsgroups data set, for MNIST the validation error generally 
decreased with an increasing number of hidden units. We therefore used C = 10 000 for all our 
experiments for both the feedforward and the recurrent networks, which we set as upper limit for network 
size as a good trade-off between performance and required compute time. However, with such many 
hidden units on a training set of 60 000 data points, and with as few as only 10 labeled training samples in 
total, overfitting effects have to be taken into consideration. We discuss these more deeply in secs. 4.3.2 
and 4.3.4. In general, we encountered an increase in error rates on prolonged training times only for the 
r-NeSi algorithm in the semi-supervised settings when no self-labeling was used. For this case only, we 
devised and used a stopping criterion based on the likelihood of the training data. 


Normalization. The dependence of the validation error on the normalization constant A shows similar 
behavior as for the 20 Newsgroups data set. Following a screening according to the tuning protocol, the 
setting of ^4 = 900 (that h, A/D k. 1.15) was chosen. 

Learning Rates. While a high learning rate can be used to overcome shallow local optima, a lower learn¬ 
ing rate will in general yield better results with the downside of a longer training time until convergence. 
As trade-off between performance and training time, we chose ew= 0.2 x C/N and = 0.2 x AT/L for 
all experiments on MNIST. Since for networks using self-labeling the number of effectively used labels L 
approaches N over time, we scale the learning rate for those system with K/N instead of K/L, that is 
En = 0.2 X K/N for r+- and ff+-NeSi. 

BvSB Threshold. With C = 10 000 and only 50 labels in total in the training set during parameter 
tuning, there is only a single label per 200 middle layer fields available to learn their respective classes. In 
this setting, using self-labeling on unlabeled data as described in sec. 3.3, decreased the validation error 
significantly over the whole tested regime of G [0.1, 0.2,..., 0.9]. We chose d = 0.6 as the optimal 
value. 


4.3.2 A Likelihood Criterion For Early Stopping 

Training of the first layer in the feedforward network is not influenced by the state of the second layer, 
and is therefore independent of the number of provided labels. This is no longer the case for the recurrent 
network. A low number of labels can lead to overfitting effects in r-NeSi when the number of hidden units 
in the first hidden layer is substantially larger than the number of labeled data points. However, when 
using the inferred labels for training in the r'''-NeSi network such overfitting effects will vanish again. 

Since learning in our network corresponds to maximum likelihood learning in a hierarchical generative 
model, a natural measure to define a criterion for early stopping can be based on monitoring of the 


16 



4 NUMERICAL EXPERIMENTS 


4.3 Handwritten Digit Recognition (MNIST) 


log-likelihood, which is given by eq. (5) (replacing the generative weights (W, TZ) by the weights {W, R) 
of the network). As soon as the scarce labeled data starts overfitting the first layer units as a result of 
top-down influence in (compare Eq. T1.5), the log-likelihood computed over the whole training data is 
observed to decrease. This declining event in data likelihood can be used as stopping criterion to avoid 
overfitting without requiring additional labels. 

Fig. 5 shows an example of the evolution of the av¬ 
erage log-likelihood per data point during training 
compared to the test error. For experiments over 
a variety of network sizes, we found strong neg¬ 
ative correlations of (PPMCC) = —0.85 ± 0.1. 
To smooth-out random fluctuations in the likeli¬ 
hood, we compute the centered moving average 
over 20 iterations and stop as soon as this value 
drops below its maximum value by more than the 
centered moving standard deviation. The test error 
in fig. 5 is only computed for illustration purposes. 
In our experiments we solely used the moving aver¬ 
age of the likelihood to detect the drop event and stop learning. In our control experiments on MNIST, we 
found that the best test error generally occurred some iterations after the peak in the likelihood (compare 
fig. 5), which we however for simplicity not exploited for our reported results. 



Iteration 

Figure 5: Evolution of test error (solid) and log-likelihood 
(dashed) in r-NeSi during training. Both show a strong neg¬ 
ative correlation. The vertical line denotes the stopping point. 


4.3.3 Results on MNIST 

Tab. 4 shows the results of the NeSi algorithms on the MNIST benchmark. As the NeSi model has no 
prior knowledge about spatial relations in the data, the given results are invariant to pixel permutation. As 
can be observed, the recurrent networks (r-NeSi) result in significantly lower classification errors than the 
feedforward networks (ff-NeSi) in the fully and the weakliest labeled settings. In between those extrema, 
we find a regime where the feedforward networks do not only catch up to the recurrent networks but 
even perform slightly better. In this highly over-complete setting, we now also see a significant gain in 
performance for the semi-supervised settings with the additional self-labeling (ff^-NeSi and r'''-NeSi). 
With these additional inferred labels, the feedforward network surpasses the recurrent version also in the 
settings with very few labels, down to a single label per class. For this last setting however, we had to 
increase the training time to 2000 iterations to assure convergence, since learning in the top layer with a 
single label per class per iteration is very slow when not adjusting the learning rate. 

Fig. 6 shows a comparison to standard and recent state-of-the-art approaches. The NeSi networks are 
competitive—outperforming deep belief networks (‘DBN-rNCA’) and other recent approaches (like the 
‘Embed’-networks, ‘AGR’ and ‘AtlasRBF’). In the light of reduced model complexity and effectively used 
labels, we can furthermore compare to the few very recent algorithms with a lower error rate (‘M1-I-M2’, 
‘VAT’ and the ‘Fadder’-networks). 

For the comparison in fig. 6, we have to point out, that (for lack of more comparable findings) all other 
algorithms are actually reporting results for a markedly differing (and generally easier) task than we 
do. All of these models either use a validation set with a substantial amount of additional labels than 
available during training or the test set for parameter tuning. Also, some of the algorithms (namely 
the TSVM, AGR, AtlasRBF and the Em-networks) actually train in the transductive setting, where the 
(unlabeled) test data is included into the training process. For the NeSi approaches however, we avoided 
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11 100 labels 11 600 labels 11 1.000 labels 11 3,000 labels 



+10,000 labels 



+1,000 labels 



Figure 6: Comparison of different algorithms on MNIST data with few labels. The top figure shows results for systems using 
100, 600, 1000, and 3000 labeled data points for training. The algorithms are described in detail in the corresponding papers: 
^Salakhutdinov and Hinton [2007], +iu et al. [2010], ^Weston et al. [2012], '‘Pitelis et al. [2014], ®Kingma et al. [2014], 
®Rasmus et al. [2015], ^Miyato et al. [2015]. All algorithms except ours use 1000 or 10 000 additional data labels (from the 
training or test set) for parameter tuning. The bottom figure gives the number of tunable parameters (as estimated in tab. D.l) 
and, where known, learned parameters of the algorithms (note the different scales). 




Figure 7: Classification performance of different algorithms compared against varying proportion of labeled training data. The 
corresponding papers are listed in fig. 6. The left-hand-side plot shows the achieved test errors w.r.t. the amount of labeled data 
seen by the compared algorithms during training. The right-hand-side plot illustrates for the same experiments the total amount 
of labeled data seen by each of the algorithms over the whole tuning and training procedure. For better readability, we only show 
the recun'ent NeSi networks in the left-hand side plot. Results for the feedforward networks can be directly transmitted from the 
right-hand side. The plots can be read similar to ROC curves, in the way that the more a curve approaches the upper-left corner, 
the better is the performance of a system for decreasing amounts of available labeled data. 
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4.3 Handwritten Digit Recognition (MNIST) 


#labels 

ff-NeSi 

r-NeSi 

ff+-NeSi 

r+-NeSi 

10 

55.46 ±0.57 (*) 

29.61 ±0.57 (*) 

10.91 ±0.86 (*) 

17.90 ±0.89 (•> 

100 

19.08 ±0.26 

12.43±0.15 

4.96 ±0.08 

4.93 ±0.05 

600 

7.27 ±0.05 

6.94 ±0.05 

4.08 ±0.02 

4.34 ±0.01 

1000 

5.88 ±0.03 

6.07 ±0.03 

4.00 ±0.01 

4.26 ±0.01 

3000 

4.39 ±0.02 

4.68 ±0.02 

3.85 ± 0.01 

4.05 ±0.01 

60000 

3.27 ±0.01 

2.94 ±0.01 

3.27 ±0.01 

2.94 ± 0.01 


Table 4; Test error on permutation invariant MNIST for different semi-supervised settings using the feedforward and the recurrent 
Neural Simpletrons with and without self-labeling. We differentiate here between settings with different amounts of labels 
available during training. For results marked with the free parameters were optimized using additional labels. We used the 
same parameter setting for all experiments shown here, which was tuned using 100 labels in total. The results are given as the 
mean and standard error (SEM) over 100 independent repetitions, with randomly drawn, class-balanced labels. In the fully 
labeled case, there are no unlabeled data points to use self-labeling on. Therefore the results of ff- and fE-NeSi are identical 
there, as well as those of r- and r^-NeSi. 


any training or tuning on the test set or on additional labeled data. This also prevents the risk of overfitting 
to the test set. The more complex a system is, the more labels are generally necessary to find optimal 
parameter settings that are not overfitted to a small validation set and generalize poorly. When using test 
data during parameter tuning, the danger of such overfitting is even more severe as overfitting effects 
could be mistaken as good generalizability. Therefore, in fig. 6 we grouped the models by the amount of 
additional labeled data points used in the validation set for parameter tuning and also show the number 
of free parameters for each algorithm, as far as we were able to estimate from the corresponding papers. 
These numbers have of course to be taken with high care, as not all parameters can be treated equally. For 
some tunable parameters, for example, a default value may already always give good results, while others 
might have to be highly optimized for each new task. Thus, these numbers should be taken more as an 
index for model complexity. 

Fig. 7 shows the performance of the models with respect to the number of labels used during training 
(left-hand side) and with respect to the total number of labels used for the complete tuning and training 
procedure (right-hand side). For the NeSi algorithms, these plots are identical, as we only use maximally 
as many labels in the tuning phase as in the training phase for the shown results. For all other algorithms 
however, these plots can be regarded as the two extreme cases, where their actual performance in our 
chosen setting would probably lie somewhere in between. 


4.3.4 Overfitting Control for NeSi 

With a network of 10 000 hidden units which learns on 60 000 training samples, some of the hidden 
units adapt to represent more rarely seen patterns while others adapt to represent patterns, that are more 
frequent in the training data. Furthermore, the network learns the frequency at which patterns occur as the 
distribution p{c\R) = ^Y^f.Rkc- Fig- 8 displays a random selection of 100 out of the 10 000 fields after 
training using the r‘''-NeSi algorithm: 

Fields colored in blue in fig. 8 have a very low probability of p{c\R) • N < 0.5, with most of them p{c\R) 
being close to zero. These fields have ceased to further specialize to respective pattern classes because 
of sufficiently many other fields that have optimized for a class. They are effectively discarded by the 
network itself, as the low values in Rkc further suppress the activation of those fields in the recurrent 
network. With longer training times, p{c\R) of those fields converges to zero, which practically prunes 
the network to the remaining size. The red fields in fig. 8 have a probability of 0.5 < p{c\R) • iV < 1.5 to 
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Figure 8: A subset of converged weights learned by the r^- 
NeSi algorithm in the setting of 100 labels. The squared 
fields are 100 of the 10 000 learned weights Wc,-. with their 
learned class belonging as columns next to the fields 
(starting with class ‘0’ at the top to class ‘9’ at the bot¬ 
tom of each column). Blue fields have p{c\R) ■ N < 0.5. 
Those are ‘forgotten fields’ of the network whose connec¬ 
tions are too weak for further specialization. The red fields 
have 0.5 < p{c\R) ■ N < 1.5. Those are fields that highly 
specialized to a single pattern in the training set. 
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be activated, which corresponds to approximately one data point in the training set that activates the field. 
Such weights are often adapted to one single training data point with a very uncommon writing style (like 
the crooked ‘7’ in 4th column, 9th row) or some kind of preprocessing artifact (like the cropped ‘3’ in 2nd 
column, 7th row). 

We did control for the effect of the rarely active fields (blue and red in fig. 8), especially as some of the 
fields are clearly overfitted to the training set. For that, we compared an original network of 10 000 fields 
(that is, 10 000 middle layer neurons) with a network for which all fields with activity p{c\R) ■ N < 1.5 
were removed (which was around 15% of the 10 000 fields). We observed no significant changes in the 
test error between the original and the pruned network. The reason is, that the pruned fields are essentially 
never activated at test time because of low similarities to test data and strong suppression by the network 
itself (due to the learned low activation rates during training). 


4.4 Large Scale Handwriting Recognition (NIST SD19) 

Modem algorithms—especially in the field of semi-supervised learning—should be able to handle and 
benefit from the ever increasing amounts of available data (‘big data’). A comparable task to MNIST, but 
with many more data points and much higher input dimensionality, is given by the NIST Special Database 
19. It contains over 800 000 binary 128 x 128 images from 3600 different writers (with around half of 
the data being handwritten digits and the other half being lower and upper case letters). We perform 
experiments of both digit recognition (10 classes) and case-sensitive letter recognition (52 classes). We 
first applied the NeSi networks to the unpreprocessed NIST SD 19 digit data with D = 16 384 input 
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Figure 9: Visualization of learned weights of middle layer units of a ff+-NeSi network when trained in the semi-supervised 
setting for NIST hand-written letters data using 520 labels (10 per class). 
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4.4 Large Scale Handwriting Recognition (NIST SD19) 


#labels/class 

1 

to 

60 

too 

300 

fully labeled 

digits (10 classes) 






#labels total 

10 

too 

600 

1000 

3000 

344307 

ff+-NeSi 

7.56 ±1.79 

6.20 ±0.16 

6.02 ±0.08 

6.02±0.12 

5.70 ±0.03 

5.11 ±0.01 

r+-NeSi 

9.84 ±2.40 

6.14 ±0.23 

5.83 ±0.14 

5.94 ±0.12 

5.72 ±0.10 

4.52 ±0.01 

35C-MCDNN 






0.77 

letters (52 classes) 






#labels total 

52 

520 

3120 

5200 

15600 

387361 

ff+-NeSi 

55.70 ±0.62 

46.22 ±0.43 

44.24 ±0.23 

43.69 ±0.21 

42.96 ±0.28 

34.66 ±0.05 

r+-NeSi 

64.97 ±0.85 

54.08 ±0.38 

43.73 ±0.15 

41.57±0.13 

37.95 ±0.12 

31.93 ±0.06 

35C-MCDNN 
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Table 5: Test error on NIST SD19 data set on the task of digit and letter recognition for different total amounts of labeled data. 
The results for NeSi are permutation invariant and given as the mean and standard error (SEM) over 10 independent repetitions, 
with randomly drawn, class-balanced labels. 


pixels. The data is of much higher dimensionality than MNIST and the patterns are not centered by pixel 
mass, which represents a significantly more challenging task, as a lot more uninformative variation is kept 
within the data. Hence, having a mixture model, learning these variations would need many more hidden 
units to achieve similar performance. When keeping the same parameter setting as for MNIST (where we 
only increased A to 25000, giving H/D ps 1.5, to account for the increased input dimensionality), the 
best performance for digit data in the fully labeled case was achieved by the r-NeSi network with an error 
rate of 9.5%. 

For better performance and easier comparison, we preprocessed the data similar to MNIST [compare 
Cire§an et ah, 2012]: for each image, we calculate square bounding boxes, resize to 20 x 20, zero-pad 
to 28 X 28 and center by pixel mass. Finally, we invert the image, such that patterns have high pixel 
values instead of the background as is the case for MNIST. For simplicity’s sake and because of its high 
similarity, we then use the same setting for our free model parameters as we used for MNIST without 
further retuning. The experiments are done using I, 10, 60, 100, 300, or all labels per class. We allowed 
for the same number of iterations as for MNIST to give sufficient training time for convergence. However, 
with roughly five times more training data than for MNIST but the same total amount of labels, we now 
have a five fimes lower average acfivafion in fhe fop layer unfil self-labeling sfarfs. In fhe semi-supervised 
seffings, we therefore scale the learning rate of the top layer also by a factor of five compared to MNIST 
to Cii = 1 X K/N for comparable convergence times. Fig. 9 shows some examples of learned weights 
by the ff+-NeSi network with 10 labels per class. In tab. 5, we report the mean and standard error 
over 10 experiments on both digit and letter data. For the NeSi networks, the results are given for the 
permutation invariant task. To the best of our knowledge, this is the first system to report results for 
NIST SD19 in the semi-supervised setting. 

Like for MNIST, the performance of our 3-layer network is in the fully labeled setting not competitive to 
state-of-the-art fully supervised algorithms [like the 35c-MCDNN, a committee of 35 deep convolutional 
neural networks, Cire§an et ah, 2012]. Note the difference, however, that our results do apply for the 
permutation invariant setting and do not take prior knowledge about two-dimensional image data into 
account (like convolutional networks). More importantly, for the settings with few labels, we only see a 
relatively mild decrease in test error when we strongly decrease the total number of used labels. Even for 
just ten labels per class most patterns are correctly classified for the challenging task of case-sensitive 
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letter classification (chance is below 2%). Comparison of the digit classification setting with MNIST 
furthermore suggests, that not the relative but the absolute amount of labels per class is important for 
learning in our networks [compare, for example, Rasmus et ah, 2015, footnote 4]. 

In general, digit classification with NIST SD19 seems to be a more challenging task than MNIST [which 
can also be observed in the results of Cire§an et ah, 2012]. However, the test error in our case increased 
slower than for MNIST with decreasing numbers of labels—and in the extreme case of a single label 
per class even surpassed the MNIST results. When using, as for MNIST, only 60 000 training examples 
for NIST, the test error for the single-label setting on digit data increased from (7.56 ± 1.79)% to 
(9.10 ± 0.92)% for ff'''-NeSi, showing the benefit of additional unlabeled data points. The feedforward 
network with self-labeling is best in keeping a low test error with very few labels. In fact, the main 
reason for the increase in test error for the single-label case are rare outliers, where two or more classes 
were learned completely switched, for example, all ‘3’s were learned as ‘8’s and vice versa. This can 
happen, when the single randomly chosen labeled data points of two similar classes are too ambiguous 
and therefore lie close together at the border between two clusters. This resulted in most networks within 
the 10 runs to have test errors between 5.5% and 7%, and one outlier at over 20% (see appendix C). And 
it seems that additional unlabeled data points lead to better defined clusters, where this problem occurs 
less frequently. Since in the recurrent network the label information is also fed back to the middle layer, 
this network is more sensible to label information. On one hand, this helps when more label information 
is known. On the other hand, this also more often results in a stronger accumulation of errors in the 
self-labeling procedure as wrong labels are less frequently corrected. 

With more training data available than for MNIST, we also tried out bigger networks of 20 000 hidden 
units for digit data, but only saw slight improvements on the test error. This points to a limit of learnable 
subclasses (a.k.a. writing styles) within the data, where the modeling of more than C = 10 000 subclasses 
improves performance only very little but the increased amount of data in NIST helps to better define 
those given subclasses. 


4.5 Comparison to Bio-Inspired Neural Networks for Neuromorphic Hardware 

In addition to systems optimized for functional performance on standard CPU and GPU hardware, another 
line of research investigates learning systems that are well-suited for execution on alternative approaches 
such as analog VLSI circuits. Most such systems are based on spiking neuron models and neurally 
plausible learning rules such as spike-timing-dependent-plasticity [STDP; Gerstner et ah, 1996, Bi and 
Poo, 2001]. A major advantage of learning algorithms implemented on analog VLSI chips are their time 
and energy efficiency compared to conventional hardware. These features have the potential to make 
analog VLSI chips, which are in this context often referred to as neuromorphic chips, to a very high 
potential new hardware technology. 

Architecture and task domain of bio-inspired learning systems share properties with deep neural networks 
and the simpletron systems discussed here, which makes comparison interesting. We compare here to 
three recent versions of spiking neural networks that learn unsupervised on data. Notably, also for this 
research domain MNIST is used as a major tool for evaluation, which facilitates comparison. We adapt 
the NeSi networks to relate more closely to the respective systems we compare to. Except for the network 
size C, we keep all free parameters at the optimized setting of sec. 4.3 and report test errors as the mean 
and standard error (SEM) over 10 independent training runs. 

While bio-inspired systems are increasingly often realized on neuromorphic hardware [for example, 
Schmuker et al., 2014], the results of the systems we compared to were obtained in simulations on 
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conventional hardware as reported in the corresponding papers [Diehl and Cook, 2015, Neftci et ah, 2015, 
Nessler et ah, 2013]. Before we discuss comparison details, let us note, that the scope and goals of systems 
for neuromorphic hardware are different from those of the deep networks, our neural networks and the 
other systems discussed above. A main goal being efficient implementability on neuromorphic hardware, 
which is not in the focus of deep learning systems. Most bio-inspired systems for neuromorphic hardware 
are therefore based on spiking neurons as such neurons are routinely implemented on neuromorphic chips. 
Neither our networks nor any of the other systems we considered above use spiking neurons. 

Let us first consider the bio-inspired spiking neural network (SNN) model of Diehl and Cook [2015], 
which consists of an input layer and a single hidden layer of up to 6400 spiking neurons. Results for 
MNIST are obtained after transforming the observed data to Poisson spike trains. After training, a class 
label is assigned to each field by determining their highest average activation per class on the training 
data. Such a training procedure is comparable to the ff-NeSi algorithm, which also learns the first hidden 
layer completely unsupervised and only uses data labels to assign classes to the learned representations in 
a separate training stage. However, instead of using a max-assignment as in Diehl and Cook [2015], we 
use an additional neural layer which approximates a Bayesian classifier (Eqs. T1.6 and T1.8) and learns 
the complete conditional distribution p{c\k, R) = Rkc- When scaled to 6400 neurons, the network of 
Diehl and Cook [2015] achieves a 5.0% error rate on MNIST. With our similar (but non-spiking) ff-NeSi 
network of 6400 neurons, we obtained a test error of (3.28 ± 0.04)%. To make the systems still more 
similar, we used the same max assignment of top-layer weights by Diehl and Cook [2015] also for our 
ff-NeSi network, that is, we assigned to each field the single unweighted label which corresponds to the 
class that activated the field most in the training set. When using this hard assignment, we found that 
our test error increased from (3.28 ± 0.04)% to (3.62 ± 0.03)% for the fully labeled case, showing a 
benefit of a probabilistic treatment. If we, like before, only used 100 random class balanced labeled data 
points for the class assignment of fields, we achieved a classification error of (19.97 ± 0.84)% for the 
ff-NeSi network with 6400 neurons, and (21.00 ± 0.86)% when we used the hard class assignment of 
Diehl and Cook [2015]. Using self-labeling instead, fU-NeSi achieved a test error of (5.10 ± 0.18)%. 
The semi-supervised settings have not been investigated by Diehl and Cook [2015] but it would represent 
interesting data for comparison, and it should be straightforward to operate the spiking network model 
also in this regime. 

The second system we compare to is the Synaptic Sampling Machine (SSM), recently suggested by Neftci 
et al. [2015]. The network consist of 28 x 28 = 784 neurons in the input layer, 500 neurons in the first 
hidden layer and 10 neurons in the top layer. Inference and learning is implemented based on spiking 
neurons with MNIST data represented by Poisson spike trains. The SSM is closely related to Restricted 
Boltzmann Machines [RBMs; see, for example, Dayan and Abbott, 2001, Salakhutdinov and Hinton, 
2009] with weight changes following a continuous time variant of contrastive divergence [Hinton, 2002]. 
If trained on the MNIST training set using all labels, the SSM achieves an error rate of 4.4% on the test 
set. In addition to the SSM, Neftci et al. [2015] also considered standard discrete time RBMs with the 
same architecture (784-500-10). Using the most conventional setting with Gibbs sampling and standard 
contrastive divergence, the RBM obtained an optimal test error of 5.0% (again using all labels). Learning 
for the RBM was here assumed to stop at the point of optimal performance while longer learning resulted 
in larger test errors which were attributed to decreased MCMC ergodicity and overfitting [Neftci et al., 
2015]. An improved RBM variant, the dSSM network, did not suffer from such overtraining effects. 
The test error of the dSSM (architecture 784-500-10) on fully labeled MNIST was 4.5%. The SSM, 
RBM and dSSM systems have essentially the same network architecture as our NeSi systems if we use 
500 middle layer neurons. For this setting (without further optimization of the remaining free parameters), 
the ff-NeSi network achieved a test error of (4.95 ± 0.03)% and r-NeSi of (3.97 ± 0.03)% (both fully 
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labeled). When using only 100 labels for training, r'''-NeSi achieved a test error of (11.50 ± 1.37)% and 
ff+-NeSi of (8.36 ± 0.64)%. 

So far, we used the usual training setup, where classes are learned from labeled training data. An 
alternative evaluation procedure is suggested by Nessler et al. [2013] for the final system we compare 
to. The spike-based Expectation Maximization approach (SEM) implements a generative Poisson model 
as spiking neural network with STDP rules. The network is trained unsupervised with 100 neurons on 
MNIST. The class assignment of the learned representations is then done directly by the user who inspects 
the fields and assigns to each field what he considers the most likely label. With this procedure, the 
network of Nessler et al. [2013] achieves a test error of 19.86%. We can adopt the same procedure by 
only training the first layer of an ff-NeSi network and then assigning the fields manually with labels 
by setting the weights R^c to 


^kc 


Sc' ^lOk 


(24) 


Using this procedure, the NeSi network achieved a test error of (10.53 ± 0.11)%. Eor functional goals, 
we could further improve on these results. We could, for example, ask the user to assign a certainty weight 
to the chosen labels or even ask to assign a probability distribution over all possible labels. Improvements 
are also possible without requesting further information from the supervisor. By using recurrence and 
self-labeling of the r'''-NeSi network, we were able to improve classification down to an error rate of 
(5.15 ± 0.26)% based on 100 fields labeled by the user (see appendix B.3 for details). 

Notably, other lines of research [for example, Esser et al., 2015, Diehl et al., 2015] do not consider 
networks for spike-based learning and inference but focus on spike-based inference alone. Typically, 
in a first stage, standard (non-spiking) discriminative networks are trained using conventional back- 
propagation, and only afterwards, in a second stage, the trained networks are translated to spiking versions 
that can be implemented on neuromorphic hardware. As such approaches are, in this sense, no spike-based 
learning systems, and because of their fully supervised setting inherited from standard deep learning, they 
are not considered here as spike-based learning networks. 

A summary of the most relevant comparison results is given in tab. 6. 


5 Discussion 

Deep learning is an important and highly successful research field with approaches filling a spectrum of 
algorithms from purely feedforward and discriminative neural networks to directed generative models. 
Deep discriminative neural networks (DNNs) dominate the field, especially in the prominent domain of 
classification tasks. By deriving the NeSi algorithms from a directed generative model, we have shown 
in this study that inference and learning in a deep directed graphical model can take a very similar form 
as learning in standard DNNs. Eurthermore, the derived networks, which we called Neural Simpletrons 
(NeSi), do in our empirical comparison improve on all standard deep neural networks (like deep belief 
networks and CNNs) when only limited amounts of labeled data are available, and they are competitive to 
very recent deep learning approaches. 

Relation to Standard and Recent Deep Learning. Neural Simpletrons are, on the one hand, similar to 
standard DNNs as they learn online (that is, they learn per data point or per mini-batch), are efficiently 
scalable, and as their activation and learning rules are local, elementary, and neurally plausible (see tab. 1). 
On tbe other band, tbe NeSi networks exhibit features that are a hallmark of deep directed generative 
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algorithm 

2nd layer neurons 

class assignment 

test error [%] 

SNN [Diehl and Cook, 2015] 

6400 

hard max 

5.0 

ff-NeSi 

6400 

hard max 

3.62 ±0.03 

ff-NeSi 

6400 

implicit 

3.28 ±0.04 

SSM [Neftci etal., 2015] 

500 

implicit 

4.4 

dSSM [Neftci et al., 2015] 

500 

implicit 

4.5 

RBM [Neftci et al., 2015] 

500 

implicit 

5.0 

ff-NeSi 

500 

implicit 

4.95 ±0.03 

r-NeSi 

500 

implicit 

3.97 ±0.03 

SEM [Nessler et al.,2013] 

100 

user 

19.86 

ff-NeSi 

100 

user 

10.53±0.11 


Table 6: Comparison with bio-inspired systems on MNIST. SNN [Diehl and Cook, 2015], SSM, dSSM [Neftci et al., 2015] and 
SEM [Nessler et al., 2013] are spiking neural networks, while the NeSi networks are non-spiking. The systems are sorted by the 
number of neurons in the 2nd layer (first hidden layer) and the blocks separated by horizontal lines group those systems that 
are the most similar to each other. The standard NeSi setup is explicitly changed to facilitate comparison: ‘hard max’ refers to 
the class assignment of Diehl and Cook [2015] and ‘implicit’ refers to the (different) standard procedures to assign class labels 
for SSMs, RBMs, or the NeSi systems. In both cases learning used all labels of the MNIST training set (note that the fD- and 
r'^-NeSi versions are irrelevant for this setting). For class assignment ‘user’, learned fields were hand-labeled and no labels of the 
training set were used. All test errors were computed for the MNIST test set. We show the mean and standard error (SEM) based 
on 10 runs for the NeSi systems. Other values were taken from the respective publications. 


models such as learning from unlabeled data and integration of bottom-up and top-down information 
for optimal inference. By comparing the learning and neural interaction equations of DNNs and the 
NeSi networks directly, Eq. (T1.5) for top-down integration and the learning rules Eqs. (T1.7) and (T1.8) 
represent the crucial differences. The first one allows the NeSi networks to integrate top-down and 
bottom-up information for inference, which contrasts with pure feedforward processing in DNNs. The 
second one shows, that NeSi learning is local and Hebbian while approximating likelihood optimization, 
which contrasts with less local back-propagation for discriminative learning in standard DNNs. In the 
example of the NeSi networks, recurrent bottom-up/top-down integration was especially useful in the fully 
labeled case (particularly in the complete setting, see sec. 4.2.4). When we acquire additional inferred 
labels through self-labeling, the feed-forward system was best in maintaining a low test error even down to 
the limit of a single label per class. Eor fully labeled data, the NeSi systems are not competitive anymore, 
as seen, for example, on MNIST. Discriminative approaches dominate in this regime as it seems to be 
difficult to compete with discriminative learning with such a minimalistic system once sufficiently many 
labeled data points are available. Eurthermore, the generative NeSi approach relies on the possibility to 
learn representations of meaningful templates (as shown, for example, in figs. 4, 8 and 9); and template 
representations make the networks very interpretable. However, for example for large image databases 
showing 2-D images of 3-D objects, learning of such templates based on pixel intensities seems very 
challenging. Eor NeSi networks, an additional feature layer (with additional parameters) is likely to 
facilitate the learning of template representations. Such a requirement would however significantly divert 
from our study of a minimalistic system. 

Besides of the approaches studied here, many other systems are able to make use of top-down and 
bottom-up integration for learning and inference. Top-down information is provided in an indirect way 
if a system introduces new labels itself by using its own inference mechanism. Similar to the fD- and 
r‘''-NeSi networks, this self-labeling idea has been followed repeatedly previously [for a recent overview, 
see Triguero et al., 2015]. Eor the NeSi systems, such feedback worked especially well, which may 
indicate that self-labeling is particularly promising for deep directed models. Systems that make a more 
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direct use of bottom-up and top-down information include approaches based on undirected graphical 
models. The most prominent examples, especially in the context of deep learning, are deep restricted 
Boltzmann machines (RBMs). While RBMs are successfully used in many contexts [for example, Hinton 
et ah, 2006, Goodfellow et ah, 2013, Neftci et ah, 2015], performance of RBMs alone, without hybrid 
learning approaches, does not seem to be competitive with recent results on semi-supervised learning. The 
best performing RBM-related systems we compared to here, are the HDRBM [Larochelle and Bengio, 
2008] for 20 Newsgroups and the DBN-rNCA system [Salakhutdinov and Hinton, 2007] for MNIST. Both 
approaches use additional mechanisms for semi-supervised classification, which can be taken as evidence 
for standard RBM approaches being more limited when labeled data is sparse. In this semi-supervised 
setting, both ff-NeSi and r-NeSi perform better than the DBN-rNCA approach for MNIST (figs. 6 and 7) 
and better than the HDRBM for 20 Newsgroups (tab. 3). When optimized for the fully labeled setting, 
NeSi even improves considerably to the HDRBM in the fully labeled 20 Newsgroups task. Recent RBM 
versions, enhanced and combined with discriminative deep networks [Goodfellow et ah, 2013], outperform 
NeSi networks on fully labeled MNIST—however, competitiveness in semi-supervised settings has not 
been shown, so far. In our empirical evaluations, we also compared to a non-hybrid RBM approach more 
directly. When using the very same network architecture (same layer and same neuron numbers), ff-NeSi 
and r-NeSi performed better than the RBM for fully labeled MNIST (see comparison to bio-inspired 
systems below). 

Other approaches that can make use of bottom-up and top-down information are algorithms based on 
other types of directed graphical models. Inference in such approaches is naturally probabilistic, recurrent, 
and of high interest from the functional and biological perspectives [see, for example, Lee and Mumford, 
2003, Haefner et ah, 2015]. Regarding the learning and inference equations themselves, the compactness 
of the equations defining the NeSi algorithms and their formulation as minimalistic neural networks 
represent a major difference to pure generative approaches [such as Saul et al., 1996, Larochelle and 
Murray, 2011, Gan et al., 2015] or combinations of DNNs and graphical models [for example, Kingma 
et al., 2014]. Regarding empirical comparisons, typical directed generative models are not compared on 
typical DNN tasks but use other evaluation criteria. Prominent or recent examples such as deep SBNs 
[see, for example, Saul et al., 1996, Gan et al., 2015] have, for instance, not been shown to be competitive 
with standard discriminative deep networks on semi-supervised classification tasks, so far. In general, a 
main challenge is the necessity to introduce approximation schemes. The accuracy of approximations for 
large networks, and the complexity of the networks themselves, still seem to prevent scalability and/or 
competitive performance on tasks as discussed here. In principle however, deep directed generative 
models such as deep SBNs or other deep directed multiple-cause approaches are more expressive than 
deep mixture models. We thus interpret our results as highlighting the general potential of deep directed 
generative models also for tasks such as classification. 

Relation to Bio-Inspired Systems. Deep neural networks owe much of their success to their efficient 
implementation on standard hardware such as CPUs and, more so, state-of-the-art GPUs. Another line of 
research focuses on non-standard hardware such as neuromorphic chips. A primary goal in that field is the 
implementability of learning algorithms using spiking neurons, since most of neuromorphic developments 
use these as elementary building blocks. Many of such bio-inspired systems are similar to deep learning 
systems or other classifier approaches. The SSM sysfem suggesfed by Neftci et al. [2015] is for instance 
closely related to RBMs [DBN; Hinton et al., 2006] and the SEM approach by Nessler et al. [2013] uses 
EM to derive the spiking neural network. 

In comparison to the NeSi approaches considered in this study, the SSM system and related RBM 
approaches [see Neftci et al., 2015] are the most similar bio-inspired approaches in terms of the network 
architecture. Eurthermore, r-NeSi, SSMs and RBMs are all able to integrate bottom-up and top-down 
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information for inference. Their respective inference and learning equations are different, however: 
SSM and RBMs are based on undirected graphical models, use sampling procedures for inference 
and a contrastive divergence variant for learning. In contrast, the NeSi networks are derived from a 
directed graphical model and use inference and learning equations as online approximations of exact EM. 
Functionally, SSM, RBM and NeSi approaches achieve similar results in the setting investigated hy Neftci 
et al. [2015] (small network architecture of 784-500-10). The similar performance for the fully labeled 
case seems to highlight the common generative model nature of SSMs, RBMs and NeSi approaches. The 
r-NeSi approach has with 3.97% the lowest error rate compared to SSMs, RBMs and other bio-inspired 
systems. For this comparison, 3.97% is a low error if we consider that the SSM is with 4.4% the currently 
best performing network with spike-based learning. Recurrent inference used by r-NeSi, that is, the ability 
to integrate bottom-up and top-down information, is important for such a high performance. Still, also the 
ff-NeSi system without recurrent inference achieves 4.95%, which is slightly lower than the error rate of a 
standard recurrent RBM. 

In contrast to SSM, RBMs and NeSi, the SEM approach by Nessler et al. [2013] and related networks [for 
example, Nessler et al., 2009] use a shallow network architecture (one input and one hidden layer). In 
terms of inference and learning between the neurons, the SEM network is however the system most closely 
related to the NeSi approaches. Both, SEM and NeSi use a Poisson noise model for the bottom layer and 
learning is in both cases derived using the EM algorithm. While Poisson noise is a well-suited distribution 
for positive observables, it also results in inference equations (E-step) with weighted activity summation 
and softmax-like lateral competition among hidden units (due to explaining away effects). Such properties 
enable neural network formulations of learning and inference as was shown, for example, by Liicke and 
Sahani [2007], Liicke and Sahani [2008], Keck et al. [2012] and Nessler et al. [2013], and for related 
distributions, for example, by Deneve et al. [2008] and Nessler et al. [2009]. By combining Poisson noise 
of a mixture model with explicit normalization, the network by Keck et al. [2012] and the hierarchical 
NeSi networks arrive at very compact rate-based neural update and learning equations. Following the 
goal of realizing spike-based networks, Nessler et al. [2013] show how EM based learning for Poisson 
mixtures can be approximated using STDP Functionally, SEMs can learn unsupervised without using any 
labels. If the learned fields are assigned to digit classes by the user, test error rates can be computed for 
the SEM [Nessler et al., 2013]. By applying the same procedure to the ff-NeSi network, the SEM and 
NeSi systems can be compared: using 100 first hidden layer units in both networks, SEM achieves a test 
error of 19.86% vs. 10.53% for the ff-NeSi system. 

For SNN, SSM, SEM and other spike-based learning networks, performance for scales larger than those 
compared in tab. 6 would be interesting to investigate. In general, such large spiking networks remain 
a challenge, however. On the one hand, the number of neurons and the number of synapses that can be 
implemented on current neuromorphic chips is still relatively limited. On the other hand, simulations 
of spiking networks on standard hardware require the simulation of the neural spiking dynamics, which 
represents a considerable overhead in computational effort. The results for the scalable but non-spiking 
NeSi systems may therefore be taken as evidence in favor of the generally possible performance achievable 
by large scale spiking neural networks. 

Empirical Performance, Model Complexity and Data with Few Labels. The main focus in our study has 
been the semi-supervised task, especially in the limit of few labels. Such a regime should here not be 
considered as a special boundary case. Much to the contrary, with increasing capabilities of state-of-the-art 
sensors and data collected through other sources, large data sets are and will be increasingly easy to obtain. 
Data labels are, on the other hand, costly and often erratic. The limit of few labels is therefore arguably 
the most natural setting for many new applications [also see discussions in Collobert et al., 2006, Kingma 
et al, 2014]. 


27 



5 DISCUSSION 


Our main results for the NeSi systems were obtained using the 20 Newsgroups, the MNIST and the NIST 
SD19 data sets (with MNIST simply being the data set for which most empirical data for semi-supervised 
learning is available). Tabs. 3 and 4 and figs. 6 and 7 summarize the empirical results and those used for 
comparison. The r-NeSi system is the best performing system for the semi-supervised 20 Newsgroups 
data set, but the data set is much more popular as a fully supervised benchmark (comparison only to 
HDRBM in the semi-supervised setting). More instructive for comparison is therefore the semi-supervised 
MNIST benchmark. As can be observed in fig. 6, the NeSi algorithms achieved smaller test errors than all 
standard deep learning and a number of very recent classifier approaches. Even hybrid systems enhanced 
for semi-supervised learning such as the DBN-rNCA approach or the EmbedCNN perform less well 
than, for example, the r"*"- and fE'-NeSi algorithms. Only three very recent approaches, M1-I-M2 [Kingma 
et al., 2014], VAT [Miyato et ah, 2015], and the Ladder network [Rasmus et al., 2015] show a smaller 
test error than the NeSi approaches for data with few labels. However, all of these systems are hybrid 
approaches: M1-I-M2 [Kingma et al., 2014] combines generative and back-prop learning approaches; 
the results for the VAT [Miyato et al., 2015] are obtained by combining a DNN using back-prop with a 
smoothness constraint derived from the data distribution; and the ladder network [Rasmus et al., 2015] 
applies a per-layer denoising objective onto standard discriminative learning models like MLPs and CNNs. 
M1 -hM 2 hereby shares the use of generative models with NeSi networks, and both approaches can be 
taken as evidence for two hidden layers of generative latents already resulting in competitive performances. 
A difference is, however, the strong reliance of M1-I-M2 on deep neural networks to parameterize the 
dependencies between observed and hidden variables and dependencies among hidden variables which are 
optimized using DNN gradient approaches (the same applies for DNNs used for the applied variational 
approximation). Inference and learning in M1-I-M2 is therefore significantly more intricate, and requires 
multiple deep networks. Also the generative description part itself is very different (for instance, motivated 
by easy differentiability based on continuous latents) and is in M1 -hM 2 not directly used for inference. 
Eor Neural Simpletrons, the generative and the neural network connections are identical, and are directly 
used for inference. Compared to all strongly performing recent approaches [for example, Kingma et al., 
2014, Miyato et al., 2015, Rasmus et al., 2015], the NeSi networks could, therefore, be considered as the 
best performing non-hybrid approach in terms of the numerical comparisons in sec. 4. 

If we compare the considered (hybrid and non-hybrid) systems in more detail (see fig. 6) a performance 
vs. model complexity trade-off can be observed. If we consider the learning and tuning protocols that 
were used for the different systems to achieve the reported performance, large differences in the number 
of tunable parameters, the size of validation sets and the complexity of the systems can be noticed. While 
some systems only need to tune few parameters, others (especially hybrid systems) require tuning of 
many free parameters (fig. 6). Parameter tuning can be considered as a second optimization loop requiring 
labels additionally to those of the training set. These additional labels (usually those of the validation 
set) are typically not taken into account if performance on semi-supervised settings are compared. Some 
models use up to 10 000 additional labels to tune their free parameters. To (partly) normalize for model 
complexity, performance comparison w.r.t. the total number of required labels could therefore serve as a 
kind of empirical Occam’s Razor. If this total number of labels is considered, the comparison between 
system performance changes as illustrated in fig. 7 (right-hand-side plot). Considering fig. 7, the VAT 
system (1000 additional labels) could be considered to perform more strongly than the Ladder network if 
compared on the total number of labels. The figure also shows that no other system has been shown to 
operate in a regime of as few labels as were used by the NeSi systems. Especially when using self-labeling, 
NeSi networks can be applied to as few as 100 labels for the complete tuning and training procedure, 
where the algorithm achieves less than 5% error on the MNIST test set. While the numbers of tunable 
parameters for the different systems and the sizes of the used validation sets are clearly correlated (fig. 6), 
it remains unclear how many additional labels would really be required by the different systems. The two 
plots of fig. 7 could therefore be considered as two limit cases for comparison. 
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5 DISCUSSION 


Future Work and Outlook. As the NeSi networks share many properties with standard deep neural 
networks, further enhaneements sueh as network pruning, annealing or drop-out eould he investigated to 
further increase performance or efficiency. Any new technique would make the algorithms more complex 
and introduce new parameters, however, which would take us further away from our goal of a minimalistic 
generative neural network. The same would apply for any additional neural layer. Still, future extensions 
could consider more than three layers (more than one middle layer) or layers with multiple separated 
softmax functions, which would lead into a committee-like approach. Preliminary experiments with 
five separated softmax functions over 10 000 middle layer hidden units each {C = 50 000 in total) with 
r+-NeSi on MNIST already showed improvements of the test error, for example, from (4.93 ± 0.05)% to 
(4.53 ± 0.07)% when using 100 labels. Also, the comhination with discriminative learning approaches is 
a promising extension. Ideally, such a comhination would maintain a monolithic architecture and a limited 
complexity. Other studies have already shown, that deep discriminative models can he related to directed 
generative models in grounded mathematical ways [see Patel et al., 2016, for a recent example]. Similarly, 
complementary discriminative methods could he derived for the NeSi systems. Alternatively, co-training 
setups with more loosely coupled discriminative and generative learning can he investigated. Other future 
extensions of the NeSi systems may involve generalizations to other types of input. Input layers with other 
distributions including Gaussian noise could be investigated (such that also observables with negative 
values can be processed). On the other hand, the Poisson distribution is well suited to process data that 
signal the presence and absence of features (describing sums of Bernoulli distributed observables). This 
would motivate the use of a NeSi approach in combination with dictionaries learned with binary latents 
[Liicke and Sahani, 2008, Goodfellow et al., 2012, Mohamed et ah, 2012, Sheikh et ah, 2014]. Further 
research directions would be combinations with hyperparameter optimization approaches [for example, 
Thornton et al., 2013, Bergstra et al., 2013, Flutter et al., 2015] in order to increase autonomy and to 
exploit the very low number of free parameters. Finally, the probabilistic nature of the NeSi networks 
would allow to address problems such as label noise in straightforward ways, while its generative model 
relation would allow for the investigation of tasks other than classification. 
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A DERIVATION DETAILS 


A Derivation Details 


Although the resulting NeSi neural network models exist as a very compaet and simple set of equations, 
shown in tab. 1, the derivation of these equations is not necessarily trivial. Therefore, we give here 
further insight into some derivation steps to allow for a better understanding of the model at hand. In 
appendix A. 1, we give details on the derivation of the EM update rules for the underlying generative 
model. In appendix A.2 we show the necessary derivation steps to attain the approximate equivalence of 
neural online learning with EM batch learning at convergence, which is the basis of our neural network 
derivation. 


A.l EM Update Steps 


E-Step. The posterior p{k\c, I, 0) can be easily obtained by simply applying Bayes’ rule for the labeled 
and unlabeled case. For p{c\y, I, 0) however, some additional steps are necessary to attain the compact 
form shown in eq. (11): 


We start again with Bayes’ rule and use the sum and product rule of probability to regain the conditionals 
eqs. (1) to (3) of the generative model: 


^ P{y\c, W)J2f,p{c\k,TZ)p{k\l) 


(25) 


When we now insert the corresponding distributions eqs. (2) and (3) into eq. (25), the benefit of assuming 
Poisson noise for p{y\c, W) becomes apparent: First, the factorial given by the T-function directly drops 
out. Second, by using the weight constraint eq. (3), the product of exponentials Hd = e~ '^cd = 

e~^ also cancels with the denominator: 


Ec'nd((Wc'd)"‘'r(2/,+i)-i 


(26) 


/ Pik\l) = for labeled data 

\ P{k) = ^ for unlabeled data 


(27) 


Here, we used rt^as a shorthand notation to directly cover both the labeled and unlabeled case. We can 
now rewrite this result as softmax function with weighted sums over bottom-up and top-down inputs y^ 
and its argument: 


exp( Ed Vd log(>^ed) + ^og(Efc 7^fcc^fc)) 
Ec' exp( Ed Vd log(Wc'd) + log(Efc T^kc’U,)) 


exp(4) 

Ec' exp(E) 


, with 


(28) 


^c = Yl log(^cd)2/d + log( J]] U^kc)- 

d k 


(29) 
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A DERIVATION DETAILS 


A. I EM Update Steps 


M-Step. To maximize the free energy with respect to parameters Wed and TZkc, we use the method of 
Lagrange multipliers for constrained optimization: 


dT 

dWed ^ 


d 

dW7d 


Ac' ^ We'd' 

c' d' 



dT 

dTZkc 


+ 


d 


dUkc 


(E T^k'i 


k' 


- 1 = 0 . 


(30) 

(31) 


Starting with the first term of eq. (30) for Wed^ we insert the free energy eq. (8) and evaluate the partial 
derivative: 

e) = y; p{d, k\r/"\ e-^) (j,w iog(w,,,,) - w,,,,' 


n,c\k 


dWed 




The second term of eq. (30), incorporating the Lagrange multipliers, results in 

^ Ac' ( ^ We'd' ~ ^ ^ X/ X/ i^cc'Sdd') = Ac. 

d' d 


5Wcd 


(32) 


(33) 


d d! d d' 

Both terms put hack into eq. (30) and multiplied hy Wed yields 

Y,Pic\y^"\ 0°''^) {Vd^ - ^cd) + Ac Wed = 0. (34) 

n 

To evaluate the Lagrange multipliers Ac, we make use of the constraint eq. (3) hy taking the sum over d: 


n n d 


(35) 


Inserting Ac hack into eq. (34) and canceling opposing terms finally yields fhe updafe rule for Wed- 

Ep(c|y 0°'") yf - Wed^ ’ ®°'") E = 0 ^36) 

n n 


d' 


Wcd = A 


Ei-EnPd\W,d\9^'‘‘)yV 


(37) 


The derivation of R^c updates follows fhe same procedure. Evaluation of fhe fwo terms in eq. (31) and 
mulfiplicafion wifh TZ^c gives 


Ep(c,fe|y("U("\0°'") + Afc7^fcc = o. 

n 

Using fhe consfrainf eq. (2) for Tike, the Lagrange multipliers evaluate fo 

Afc = -Ep(uA:|yH(^"l0°'"). 


n,c 


Inserfing fhese back info eq. (38), we arrive af fhe updafe rule for Tlkd- 

EnP(fc|c, (0), 0°i'i)p(c|y 0), (0), 


Tike = 


Ec' EnP(^| 0 ' ( 0 ), 0 °i'i)p(c'|y 0 \ f( 0 , 0 °id) ■ 


(38) 


(39) 


(40) 
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A.2 Approximate Equivalence of Neural Online Learning 


A DERIVATION DETAILS 


A.2 Approximate Equivalence of Neural Online Learning at Convergence 


In more detail, to derive eqs. (19), we first consider the dynamic behavior of the summed weights 
Wc = ^cd and Rk = -^^c- By taking sums over d and c for eqs. (16) and (17) respectively, we 

obtain 

AWc = ewSciA — Wc), ARk = eRtk{B - R^). (41) 

As we assume Sc, tk > 0, we find that for small learning rates ew, the states Wc = A and Rj^ = B are 
stable (and the only) fixed points of the dynamics for Wc and Rk- This applies for all k and c and for any 
Sc and tk that are non-negative and continuous w.r.t. their arguments. 


The above result uses an approach developed by Keck et al. [2012] which we apply here to a hierarchical 
system with two hidden layers instead of one, and by considering label information. By assuming 
normalized weights based on eqs. (41), we can approximate the effect of iteratively applying eqs. (16) 
and (17) as 


I 


p^(n-l-l) _ ’’’’cd ■■ ^ ^ •'a 

Ed' + etv Se(yu(-),0W)2/ir)) 


(42) 


and 


i?' 


(n-l-l) _ 


kc 


= B 


Ec' (4c' ’ 


(43) 


where and denote the weights at the nth iteration of learning, where 00^ = (W^'^'^, 7?^”^), and 
where 0(”)) to abbreviate notation. Both equations can be further simplified. Using 

the abbreviations = Sc{y and = tkis^^\u^'^\ Sc{y^^\u^^\Q^'^'^), 

we first rewrite eqs. (42) and (43) as 


lU, 


(n+l) _ 


cd 


= A 


+^wR 


(n) 

cd 


EA^^'S’ + «FW) 


and R 


(n-l-l) _ 


kc 


= B- 


Ni +‘rG, 


(n) 

kc 


EA<.’ + .„o£’) 




(44) 


Let us suppose that learning has converged after about T iterations. If we now add another N iterations 

(T+N) (T+N) 

and repeatedly apply the learning steps, closed-form expressions for the weights and 7?^^ 

are given by 


and 


>riP 


cd 


N 


i(T+A^—n) y~\N 
cd 




(T+W-n'), 


i-rA'' ri j- TpC^+^—'G^')'^ 

lln'=lU“r A Z^d'-^cd' ' 


R 


{T+N) ^ 
kc 


+ Ell G 


Wn'=n+1^^ + B Ec' ^kd 


kc 


{T+N-n'). 


n(v=i(i + ^ Ec' 


{T+N-n'). 
kd > 


(45) 


(46) 


The large products in numerator and denominator of eqs. (45) and (46) can be regarded as polynomials 
of order N for ew and e/j, respectively. Even for small ew and eji it is difficult, however, to argue that 
higher-order terms of ew and en can be neglected because of the combinatorial growth of prefactors given 
by the large products. 

We therefore consider the approximations derived for the non-hierarchical model in Keck et al. [2012], 
which were applied to an equation of the same structure as eqs. (45) and (46). At closer inspection of the 
terms and we find that we can apply these approximations also for the hierarchical 

case. For completeness, we reiterate the main intermediate steps of these approximations below: 
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A DERIVATION DETAILS 


A.2 Approximate Equivalence of Neural Online Learning 


Taking eq. (45) as example, we simplify its right-hand-side. The approximations are all assuming a small 
but finite learning rate ew and a large number of inputs N. Eq. (45) is then approximated by 


IY^T+N) _ En=l6Xp(^ 

cd ^ 






: exp 


N 

d' n=l 


ew 

A ^ 

d' 


sppOY 

/ . ^cd' , 


(47) 

(48) 


where = 
at iteration T). 
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(0) exp I 


(0)^ 


_ fit V F 
A ^d' ^ cd' 
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1 — exp( — 
N—n 2-^n'— 
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d' ^cd' ) 


= A 


F. 


(0) 


cd 


XF F 
l^d' ^cd' 


= A- 


Z^n=l 


cd 


e:=i f 


(49) 


cd' 


^ n +1 FjJ^^ ” ^ (note that F^^^ is the mean of F^^'^ over N iterations starting 


For the first step (47) we rewrote the products in eq. (45) and used a Taylor expansion [for details, see 
Supplement of Keck et al., 2012]: 

N 

11(1 + T ^ » exp{!2 (50) 

n'=n-|-l d' d' 


For the second step (48) we approximated the sum over n in eq. (47) by observing that the terms with 
large n are negligible, and by approximating sums of over n by the mean F^^\ For the last 

steps, eq. (49), we used the geometric series and approximated for large N [for details on these last two 
approximations, see again Supplement of Keck et al., 2012]. Furthermore, we used the fact that for small 
ew, “^wexpl-ew B) ~ (which can be seen, for example, by applying I’HopitaTs rule). 


(n) 

By inserting the definition of Fc7' into (49) we finally find: 




Analogously, we find for RkF 


K 


(T+N) 

kc 


A 


B 


Ed'ELi 

Ell 

Ec'Eli 


(51) 


(52) 


where we again used s = s {y^^\ 0(^+”)) for better readability in the last equation. If we now 

assume convergence, we can replace and by Wed and 77^1^^ and by R^c to 

recover eqs. (19) in sec. 3 with converged weights Wed and R^e- 


Note that each approximation is individually very accurate for small ew and large N. Eqs. (19) can 
thus be expected to be satisfied with high accuracy in this case and numerical experiments based on 
comparisons with EM batch-mode learning verified such high precision. 
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B COMPUTATIONAL DETAILS 


B Computational Details 

B.l Parallelization on GPUs and CPUs 


The online update rules of the neural network tab. 1 are ideally suited for parallelization using GPUs, 
as they break down to elementary vector or matrix multiplications. We observed GPU executions with 
Theano to result in training time speed-ups of over two orders of magnitude compared to single-CPU 
execution (NVIDIA GeForce GTX TITAN Black GPUs vs. AMD Opteron 6134 CPUs). 


Furthermore, we can use the concept of mini-batch training for CPU parallelization or to optimize GPU 
memory usage. There, the learning effect of a small number of consecutive updates in eqs. (16) and (17) 
is approximated by one parallelized update over ly independent updates: 


- 




= e 


:= 


.(n+2) 


cd I ’ 


i=0 
D—l 

, f (n+i) Jn+i) ^(n)\ 

Ufe ^kc ) > 


i=0 


«A;'+A;’ (53) 

+ (54) 


The maximal aberration from single-step updates caused by this approximation can be shown to be of 
0{{eiy)‘^). Since this effect is negligible for eu <C 1, as experimentally confirmed in tab. B.l, we only 
consider the mini-batch-size as a parallelization parameter, and not as free parameter that could be 
chosen to optimize training in anything else than training speed. 


mini-batch size 

1 

10 

100 

mean test error [%] 

23.1 ±0.2 

23.2 ±0.2 

23.0 ±0.2 

std. dev. (Terr [pp] 

1.27 

1.26 

1.24 

mean log-likelihood 

-836.472 ± 0.005 

-836.468 ± 0.005 

-836.475 ± 0.005 

std. dev. (Til 

0.034 

0.032 

0.036 


Table B.l: Results are shown as average over 50 training runs on a small network of C = 30 hidden units using N = 3000 
training data points of the MNIST data set. The mini-batch size shows no significant influence neither on the mean nor the 
variance of the test error or likelihood of the converged solutions. 


B.2 Weight Initialization 

For the complete setting (C = K), where there is a good amount of labeled data per hidden unit even 
when labeled data is sparse and the risk of running into early local optima where the classes are not well 
separated is high, we initialize the weights of the first hidden layer in a modified version of Keck et al. 
[2012]: We compute the mean rukd and standard deviation akd of the labeled training data for each class 
k and set Wkd = mkd + ^(0, 2akd), where lJ{xdn, Xup) denotes the uniform distribution in the range 

{xdni Xup)- 

For the overcomplete setting (C > K), where there are far less labeled data points than hidden units in 
the semi-supervised setting, and class separation is no imminent problem, we initialize the weights using 
all data disregarding the label information. With the mean rud and standard deviation ad over all training 
data points we set Wed = aid +N(0, 2ad)- 
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C DETAILED TRAINING RES VETS 


B.3 Interactive Labeling with r'^-NeSi 


The weights of the seeond hidden layer are initialized as Rkc = 1/C. The only exeeption to this rule are 
the additional experiments on the 20 Newsgroups data set in see. 4.2.4 for the fully labeled setting. As 
noted in the text, in this setting we were able to make better use of the recurrent connections of the r-NeSi 
network and the fully labeled data set by initializing the weights of the second hidden layer as Rj^c = ^kc- 


B.3 Interactive Labeling with r+-NeSi 


In sec. 4.5, we trained only the first layer of an ff-NeSi network of 100 hidden units and assigned 
class labels to the learned representations afterwards by hand. This already achieved a test error of 
(10.53 ± 0.11)%. To further improve on these results, we can use both the recurrence and the self-labeling 
of r'''-NeSi to our advantage: 


First, we can train the second hidden layer also on unlabeled data by setting t^ = 1/K. This assumes that 
all labels are equally likely for all input data points. This way the network learns the distribution p{c\R) 
from the input data and can use this information in the recurrent connections. We can then assign the 
classes to fields by weighting the labels with this learned distribution p{c\R) = Rkc- 


^kc 


Y/k' ^ 


old 

k'c 


Yd ^iN)k Yk' ^ 


old ■ 
k'c' 


(55) 


This way, we do not change the learned information p{c\R) but only set the conditional p{k\c, R) = 
^ to 5i(c)k- Using this distribution for class inference can already significantly decrease the test 
error, as can be seen by comparison of the ‘hard max’ assignment in tab. 6 to the learned complete 
distribution (‘implicit’). 


Second, instead of training a network of 100 hidden middle layer units, we can again train a much bigger 
network of 10 000 hidden units, but still only label 100 of the learned fields. During further training with 
self-labeling, the classes of these few labeled fields will also provide class information for the remaining 
99% of unlabeled fields to learn their associated classes. While there may be many more informative 
ways to pick the 100 fields that the supervisor has to label, we here simply chose those fields at random. 
Over 10 repetitions, we achieved a test error of (5.15 ± 0.26)%, which is comparable to the results when 
training on 100 random labels in the training set. 


C Detailed Training Results 


We performed 100 independent training runs for results obtained on MNIST and 20 Newsgroups in 
secs. 4.2.3 and 4.3.3, and 10 independent training runs for the NIST data set in sec. 4.4 with each of the 
given networks for each label setting with new randomly chosen, class-balanced labels for each training 
run. Tabs. C.l to C.IO give a detailed summary of the statistics of the obtained results. They show the 
mean test error alongside the standard error of the mean (SEM), the standard deviation (in pp.), as well as 
the minimal and maximal test error in the given number of runs. For the networks with self-labeling of 
unlabeled data (fU- and r'''-NeSi) we only show the semi-supervised settings, as they are identical to their 
respective standard versions in the fully labeled case. 
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D TUNABLE PARAMETERS OE THE COMPARED AEGORITHMS 


D Tunable Parameters of the Compared Algorithms 

We list in tab. D.l the tunable parameters of each method compared to in figs. 6 and 7. For some of the 
methods, this estimate only gives a lower bound on the number of tunable parameters, as parameters 
of them may have multiple instances, for example, for each added layer in the network. If a parameter 
was kept constant for all layers, we only counted it as a single parameter, whereas such parameters that 
had differing values in different layer were counted as multiple parameters. An example is the constant 
number of hidden units in ‘NN’ versus the differing numbers in the layers of the ‘CNN’. We also counted 
such parameters, that were not (explicitly) optimized in the corresponding papers itself, but were taken 
from other papers (for example, parameters of the ADAM algorithm), or where the reason for the specific 
choice is not given (like for specific network architectures). 
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D TUNABLE PARAMETERS OE THE COMPARED AEGORITHMS 


#labels 

mean test error 

std. dev. 

min. 

max. 

#labels 

mean test error 

std. dev. 

min. 

max. 

20 

70.64 ±0.68 

6.82 

55.35 

88.59 

20 

68.68 ±0.77 

7.72 

49.98 

85.48 

40 

55.67 ±0.54 

5.44 

37.53 

68.13 

40 

54.24 ±0.66 

6.59 

37.00 

66.76 

200 

30.59 ±0.22 

2.22 

26.97 

37.57 

200 

29.28 ±0.21 

2.09 

25.90 

39.60 

800 

28.26 ±0.10 

1.00 

26.68 

31.59 

800 

27.20 ±0.07 

0.70 

25.85 

29.41 

2000 

27.87 ±0.07 

0.74 

25.85 

30.01 

2000 

27.15±0.07 

0.65 

25.77 

29.13 

11269 

28.08 ±0.08 

0.78 

26.29 

30.25 

11269 

27.28 ±0.07 

0.73 

26.08 

29.82 


Table C.l: ff-NeSi on 20 Newsgroups 



Table C.2: r-NeSi on 20 Newsgroups. 


#labels 

mean test error 

std. dev. 

min. 

max. 

#labels 

mean test error 

std. dev. 

min. 

max. 

10 

55.46 ±0.57 

5.72 

42.49 

69.62 

10 

29.61 ±0.57 

5.71 

20.05 

46.05 

100 

19.08 ±0.26 

2.61 

13.31 

24.93 

100 

12.43±0.15 

1.53 

9.29 

16.25 

600 

7.27 ±0.05 

0.49 

6.01 

8.76 

600 

6.94 ±0.05 

0.49 

5.72 

8.44 

1000 

5.88 ±0.03 

0.31 

5.19 

6.97 

1000 

6.07 ±0.03 

0.28 

5.24 

6.78 

3000 

4.39 ±0.02 

0.15 

4.01 

4.89 

3000 

4.68 ±0.02 

0.19 

4.22 

5.29 

60000 

3.27 ±0.01 

0.08 

3.08 

3.46 

60000 

2.94 ±0.01 

0.08 

2.75 

3.14 


Table C.3: ff-NeSi on MNIST. 



Table C.4: 

r-NeSi on MNIST. 


#labels 

mean test error 

std. dev. 

min. 

max. 

#labels 

mean test error 

std. dev. 

min. 

max. 

10 

10.91 ±0.86 

8.64 

3.96 

53.15 

10 

18.68 ±0.89 

8.90 

5.06 

51.88 

100 

4.96 ±0.08 

0.82 

3.84 

9.13 

100 

4.93 ±0.05 

0.49 

4.26 

7.32 

600 

4.08 ±0.02 

0.17 

3.68 

4.73 

600 

4.34 ±0.01 

0.15 

3.87 

4.78 

1000 

4.00 ±0.01 

0.12 

3.76 

4.38 

1000 

4.26 ±0.01 

0.12 

3.97 

4.62 

3000 

3.85±0.01 

0.11 

3.64 

4.14 

3000 

4.05 ±0.01 

0.10 

3.84 

4.29 


Table C.5: ff-NeSi on MNIST. 



Table C.6: i 

■+-NeSi on MNIST. 


#labels 

mean test error 

std. dev. 

min. 

max. 

#labels 

mean test error 

■ std. dev. 

min. 

max. 

10 

7.56 ±1.76 

5.67 

5.52 

23.46 

10 

9.84 ±2.41 

7.61 

5.64 

34.95 

100 

6.20±0.16 

0.51 

5.49 

7.08 

100 

6.14±0.23 

0.72 

5.52 

7.84 

600 

6.02 ±0.08 

0.25 

5.72 

6.51 

600 

5.83±0.14 

0.45 

5.43 

6.50 

1000 

6.02±0.12 

0.38 

5.63 

6.99 

1000 

5.94±0.12 

0.39 

5.46 

6.49 

3000 

5.70 ±0.03 

0.10 

5.56 

5.89 

3000 

5.72±0.10 

0.33 

5.52 

6.63 

344307 

5.11 ±0.01 

0.03 

5.06 

5.16 

344307 

4.52 ±0.01 

0.04 

4.44 

4.56 


Table C.7: ff-NeSi on NIST digits. 



Table C.8: r+- 

NeSi on NIST digits. 


#labels 

mean test error 

std. dev. 

min. 

max. 

#labels 

mean test error 

std. dev. 

min. 

max. 

52 

55.70 ±0.62 

1.96 

52.88 

58.75 

52 

64.97 ±0.85 

2.70 

60.88 

69.71 

520 

46.22 ±0.43 

1.37 

43.91 

48.47 

520 

54.08 ±0.38 

1.21 

51.71 

55.89 

3120 

44.24 ±0.23 

0.74 

43.23 

45.49 

3120 

43.73±0.15 

0.47 

42.99 

44.62 

5200 

43.69 ±0.21 

0.65 

42.53 

44.40 

5200 

41.57±0.13 

0.42 

40.90 

42.21 

15600 

42.96 ±0.28 

0.88 

41.55 

44.38 

15600 

37.95±0.12 

0.38 

37.25 

38.56 

387361 

34.66 ±0.05 

0.15 

34.45 

34.86 

387361 

31.93 ±0.06 

0.18 

31.63 

32.17 


Table C.9: ff-NeSi on NIST letters. 


Table C.IO: r*-NeSi on NIST letters. 
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D TUNABLE PARAMETERS OE THE COMPARED AEGORITHMS 


Method 

Tunable Hyper-Parameters 

Total 

SVM 

C (soft margin parameter) 

1 

TSVM 

C (soft margin parameter), A (data-similarity kernel parameter) 

2 

NN 

number of hidden layers (here: 1-15), number of hidden units (same per 
layer), learning rate 

3 

AGR 

m, s, 7 , dimensionality reduetion (for aeeeleration) 

3-4 

NeSi (ours) 

C (number of middle layer units), A (input normalization), learning rates 
and Cji, BvSB threshold t? (only for r"*"- and ff'^-NeSi) 

4-5 

AtlasRBF 

A, 7 , cj, number of neighbors, loeal manifold dimensionality 

5 

Em““NN 

NN hyper-parameters, number of layers to emhed (here: all). A, m (distance 
parameter) 

6 

CNN 

number of CNN layers (here: 6), Patch size, pooling window size (2"‘^ layer), 
neighborhood radius (4* layer), 5* and layer units, learning rate 

> 9 

M1+M2 

Ml: number of hidden layers (here: 2), number of hidden units per layer, 
number of samples from posterior, M2: number of hidden layers (here: 
1), number of hidden units, a, RMSProp: learning rate, first and second 
momenta 

> 10 

DBN-rNCA 

number of layers (here: 4), number of hidden units per layer; RBM learning 
rate, momentum, weight-decay, RBM epochs, NCA epochs, A (tradeoff 
parameter). 

> 11 

EmCNN 

CNN hyper-parameters, number of layers to embed. A, m (distance parame¬ 
ter). 

> 12 

VAT 

number of layers (here: 2 - 4), number of hidden units per layer. A, e. Ip, 
ADAM [Kingma and Ba, 2015]: learning rate a, cadam. exponential decay 
rates /3i and /32; batch normalization [Ioffe and Szegedy, 2015]: mini-batch 
size for labeled and mixed set 

> 12 

Eadder 

number of hidden layers (here: 5), number of hidden units per layer, noise 
level denoising cost multipliers A^^^ for each layer, ADAM [Kingma 

and Ba, 2015]: learning rate a, cadam, iterations until annealing phase, 
linear decay rate; batch normalization [Ioffe and Szegedy, 2015]: minibatch 
size 

> 18 


Table D.l: Tunable hyperparameters of the algorithms compared on the MNIST data set. 
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